all branches on Terra that have been mentioned in this documentation will be deleted. Please use the v2.3.0 version of TheiaProk moving forward.
+
the main branch of tbp-parser will host v2.1.0 and above; earlier versions of tbp-parser will no longer be supported
+
future releases of tbp-parser will only support outputs generated by TBProfiler v6.0.0 and above.
+
+
Versions of TBProfiler prior to v6.0.0 are not compatible with v2+ of tbp-parser. Please ensure that you are using the correct version of tbp-parser for your version of TBProfiler.
tbp-parser is a tool developed in partnership with the California Department of Health (CDPH) to parse the output of Jody Phelan’s TBProfiler tool into four additional files:
tbp-parser is designed to run immediately after Jody Phelan’s TB-Profiler tool. Only two inputs are required: the JSON file produced by TB-Profiler and the BAM file produced by TB-Profiler.
+
tbp-parser is designed to run immediately after Jody Phelan’s TBProfiler tool. Only two inputs are required: the JSON file produced by TBProfiler and the BAM file produced by TBProfiler.
The JSON file contains information about the mutations detected in the sample: the quality, the type, and if that mutation confers resistance to an antimicrobial drug. The BAM file contains the alignment information for the sample and is needed for determining sequencing quality.
The path to the JSON file that was produced by TB-Profiler
+
The path to the JSON file that was produced by TBProfiler
input_bam
-
The path to the BAM file that was produced by TB-Profiler
+
The path to the BAM file that was produced by TBProfiler
-
Info
+
BAM index file required
The BAM file must have the accompanying BAI file in the same directory. It must also be named exactly the same as the BAM file but ending with a .bai suffix.
These options are primarily used for tNGS data, although all frequency arguments are compatible with WGS data.
+
These options are primarily used for tNGS data, although all frequency and read support arguments are compatible with WGS data.
@@ -1182,7 +1182,7 @@
tNGS-specific Arguments
--tngs_expert_regions
A BED file containing the regions to calculate coverage for expert rule regions. This is used to determine coverage quality in the regions where resistance-conferring mutations are found, or where a CDC expert rule is applied. This is not used for QC purposes
These options change the verbosity of the stdout log
-| Name | Description | Default Value |
-| :--- | :---------- | :------------ |
-| --verbose | Increases the output verbosity to describe which stage of the analysis is currently running | false |
-| --debug | The highest level of output verbosity detailing every step of the analysis and logic implemented; overwrites --verbose | false |
+
These options change the verbosity of the stdout log
+
+
+
+
Name
+
Description
+
Default Value
+
+
+
+
+
--verbose
+
Increases the output verbosity to describe which stage of the analysis is currently running
+
false
+
+
+
--debug
+
The highest level of output verbosity detailing every step of the analysis and logic implemented; overwrites --verbose
When running tbp-parser as part of the TheiaProk workflow series (find documentation for TheiaProk here) on Terra.bio, an optional input must be activated to instruct TheiaProk to run tbp-parser.
tbp-parser is not on by default due to the nature of this tool and its outputs.
TheiaProk Version
-
This information only corresponds to PHB v2.2.0. These inputs and outputs may not be applicable to other versions of TheiaProk.
+
This information only corresponds to the upcoming PHB v2.3.0 release. These inputs and outputs may not be applicable to other versions of TheiaProk.
CAUTION: The information produced by this program should not be used for clinical reporting unless and until extensive validation has occured in your laboratory on a stable version. Otherwise, the outputs of tbp-parser are for research use only.
tbp-parser is a tool developed in partnership with the California Department of Health (CDPH) to parse the output of Jody Phelan\u2019s TBProfiler tool into four additional files:
A Laboratorian report, which contains information about each mutation detected and its associated drug resistance profile in a CSV file.
A LIMS report, formatted specifically for CDPH\u2019s STAR LIMS, which summarizes the highest severity mutations for each antimicrobial drug and the relevant mutations.
A Looker report, which condenses the information contained in the Laboratorian report into a format suitable for generating a dashboard in Google\u2019s Looker Studio.
A coverage report, which contains the percent coverage of each gene relative to the H37Rv reference genome in addition to any warnings, such as any deletions identified in the gene that might have contributed to a reduced percent coverage
Please reach out to us at support@theiagen.com if you would like any custom file formats and/or changes to these output files that suit your individual needs.
We host our Docker images on the Google Artifact Registry so that they are always availble for usage.
The entrypoint for this Docker iamge is the tbp-parser help message. To run this container interactively, use the following command:
docker run -it --entrypoint=/bin/bash us-docker.pkg.dev/general-theiagen/theiagen/tbp-parser:1.6.0\n# Once inside the container interactively, you can run the tbp-parser tool\npython3 /tbp-parser/tbp_parser/tbp_parser.py -v\n# v1.6.0\n
"},{"location":"usage/#locally-with-python","title":"Locally with Python","text":"
tbp-parser is not yet available with pip or conda. To run tbp-parser in your local command-line environment, install the following dependencies:
python3
pandas >= 1.4.2
importlib_resources
samtools
After installation of these dependencies, download and extract the latest release of tbp-parser and run the script with Python.
Please note that the BAM file must have the accompanying BAI file in the same directory. It must also be named exactly the same as the BAM file but ending with a .bai suffix.
The help message printed by tbp-parser is quite extensive, but has a lot of useful information regarding the input parameters. Here is the entire message in full. You can find more information regarding these inputs in the Inputs section.
usage: python3 /tbp-parser/tbp_parser/tbp_parser.py [-h|-v] <input_json> <input_bam> [<args>]\n\nParses Jody Phelon's TB-Profiler JSON output into four files:\n- a Laboratorian report,\n- a LIMS report\n- a Looker report, and\n- a coverage report\n\npositional arguments:\n input_json\n the JSON file produced by TBProfiler\n input_bam\n the BAM file produced by TBProfiler\n\noptional arguments:\n -h, --help\n show this help message and exit\n -v, --version\n show program's version number and exit\n\nquality control arguments:\n options that determine what passes QC\n\n -d, --min_depth\n the minimum depth of coverage for a site to pass QC\n default=10\n -c, --min_percent_coverage\n the minimum percentage of a region that has depth above the threshold set by min_depth\n (used for a gene/locus to pass QC)\n default=100\n -s, --min_read_support\n the minimum read support for a mutation to pass QC\n default=10\n -f, --min_frequency\n the minimum frequency for a mutation to pass QC (0.1 -> 10%)\n default=0.1\n -r, --coverage_regions\n the BED file containing the regions to calculate percent coverage for\n default=data/tbdb-modified-regions.bed\n\ntext arguments:\n arguments that are used verbatim in the reports or to name the output files\n\n -m, --sequencing_method\n the sequencing method used to generate the data; used in the LIMS & Looker reports\n ** Enclose in quotes if includes a space\n default=\"Sequencing method not provided\"\n -p, --operator\n the operator who ran the sequencing; used in the LIMS & Looker reports\n ** Enclose in quotes if includes a space\n default=\"Operator not provided\"\n -o, --output_prefix\n the output file name prefix\n ** Do not include any spaces\n\ntNGS-specific arguments:\n options that are primarily used for tNGS data\n (all frequency arguments are compatible with WGS data)\n\n --tngs\n indicates that the input data was generated using Deeplex + CDPH modified protocol\n Turns on tNGS-specific global parameters\n --tngs_expert_regions\n the BED file containing the regions to calculate coverage for expert rule regions\n (used to determine coverage quality in the regions where resistance-conferring\n mutations are found, or where a CDC expert rule is applied; not for QC)\n default=data/tngs-expert-rule-regions.bed\n --rrs_frequency\n the minimum frequency for an rrs mutation to pass QC\n (rrs has several problematic sites in the Deeplex tNGS assay)\n default=0.1\n --rrl_frequency\n the minimum frequency for an rrl mutation to pass QC\n (rrl has several problematic sites in the Deeplex tNGS assay)\n default=0.1\n --rpob449_frequency\n the minimum frequency for an rpoB mutation at protein position 449 to pass QC\n (this is a problematic site in the Deeplex tNGS assay)\n default=0.1\n --etha237_frequency\n the minimum frequency for an ethA mutation at protein position 237 to pass QC\n (this is a problematic site in the Deeplex tNGS assay)\n default=0.1\n\nlogging arguments:\n options that change the verbosity of the stdout log\n\n --verbose\n increase output verbosity\n --debug\n increase output verbosity to debug; overwrites --verbose\n\nPlease contact support@theiagen.com with any questions\n
Resistance calls are made in either one of two ways. The first is using the WHO annotation, which is output directly from the TBProfiler. The WHO has a catalogue of mutations and how they may confer antimicrobial resistance. If this annotation is present, it will always be used.
In the case where the WHO annotation is missing, either due to novel mutations or mutations with unclear significance in the literature, tbp-parser will apply expert rules. These expert rules are additional conditions used to decide if a mutation is considered to confer resistance or not. These expert rules come from the CDC and can be found documented in the tbp-parser GitHub repository inside the interpretation logic PDFs.
When an expert rule is applied, the rationale field of the laboratorian report will indicate which expert rule was used (the number prefacing the rule directly correlates to the appropriate section in the interpretation logic PDF) and indicate that there was no WHO annotation.
The interpretation documents for v1.2.2 and v1.4.4.8 are available in the root directory of the tbp-parser repository. Versions that correspond to different releases are available in the interpretation_docs directory on GitHub.
tbp-parser is object-oriented, with each class representing either an output file, a part of an output file, or a part of the input JSON file produced by TBProfiler.
The first class that is invoked by the tbp-parser.py script is Parser which is a control class that orchestrates the creation of the different output reports.
Before creating any reports, Parser calls the Coverage class to calculate the percent gene coverage over a specified minimum depth (default: 10) for the coding regions of all genes included in the TBDB (the database used in TBProfiler to generate the drug resistance annotations). This requires as input the BAM and BAI files produced by TBProfiler during alignment to the H37Rv reference genome. The percent gene coverage results are then stored in a global dictionary that is accessed multiple times for QC purposes during the creation of the final reports.
"},{"location":"algorithm/technical/#creating-the-laboratorian-report","title":"Creating the Laboratorian report","text":"
Then, Parser creates the Laboratorian report using the Laboratorian class and its associated .create_laboratorian_report() method.
The Laboratorian class uses the input JSON file to collect the necessary information. The structure of the input JSON file is a good place to start the breakdown:
In this example, we can see only the relevant top-level JSON fields that are used in tbp-parser.
Of interest, the \"id\" column is used to set the global SAMPLE_NAME variable.
The lineage information, found in \"main_lin\" and \"sublin\" are used in the LIMS and Looker reports, so we won\u2019t go into detail about them here.
The variant information is what makes up the bulk of the Laboratorian report and can be found in the \"dr_variants\" and \"other_variants\" fields. We\u2019ll talk more about these fields later.
There are many other fields that are omitted from this example since they are not used in tbp-parser, such as version information and overall sample drug resistance type (like RR-TB, etc.). These fields are found in the other TBProfiler output files in more human-readable formats.
Within the input JSON file, there are two fields that are examined the most: \"dr_variants\" and \"other_variants\". These fields are treated the same, and have the same format, although different mutations are found in both regions. The difference between the two fields is unclear to me at this time. In the example below, only the fields used in tbp-parser are shown.
After the global SAMPLENAME variable is set, the Laboratorian class calls the .iterate_section() method, starting with the \"dr_variants\" field.
Since the contents of each variant section in the JSON dictionary are considered a list, we start to iterate through each list item, which consists of each section within curly brackets {...}. In the example to the left, I\u2019ve only included 1 item in each list.
Immediately, each item in the list is converted into a Variant class object, and every item in each list item (the \"chrom\", \"genome_pos\", \"locus_tag\", etc.) is converted to a class attribute. This is because each item in the list represents a single mutation or a single variant. I\u2019ll now refer to each variant section item as a Variant.
Each new Variant object has the .extract_annotations() method called. This method starts by iterating through the \"annotation\" field in the input JSON. The annotation field can contain multiple different annotations, so we look at each one individually.
Each annotation is turned into a Row object, which represents a row in the Laboratorian report. During the initiation of the Row object, each column in the Laboratorian report is created based on both the annotation field and the originating Variant object. Additionally, a warning field is created based on both the global dictionary created with the Coverage class and the mutation\u2019s \"depth\" and \"freq\" fields.
Sometimes multiple annotations for the same drug can appear for a single Variant. If this is the case, only the most severe annotation is saved (that is, an annotation that indicates resistance is kept instead of one that indicates susceptibility).
After the annotation field has been iterated through, we then check the \"gene_associated_drugs\" field to make sure that we create a Row for each antimicrobial drug that is associated with the gene. As you can see in the \"other_variants\" section, the annotation field for the variant only lists annotations for moxifloxacin and levofloxacin, but the gene is associated with three other antimicrobial drugs. This iteration creates additional Row objects for those antimicrobial drugs.
This means that each mutation will potentially appear several times in the final report, once for every antimicrobial associated with the drug. This is because sometimes a mutation confers a different resistance level to one drug, but not another.
After Row objects are created for each Variant in the variant section, every Row has the .complete_row() method called, which adds the interpretation columns to the object. Two interpretation columns are created, mdl_interpretation and looker_interpretation.
Please note that these interpretation columns are typically identical, but in several cases, the mdl_interpretation column will call a variant-drug combination as \u201csusceptible\u201d (S), while the looker_interpretation column will call the same combination \u201cuncertain\u201d (U).
In the case where a WHO annotation was not identified, the Variant class\u2019 .apply_expert_rules() method is called. This function applies expert rules that are listed in detail on the tbp-parser GitHub repository, available here.
The expert rules assign a drug resistance call to the variant-drug combination only when there is no WHO annotation and will fill the mdl_interpretation and looker_interpretation fields.
If the mutation is in either mmpS5, mmpL5, or mmpR5/Rv0678, then the \"alternate_consequences\" field is iterated through. This field typically lists the same mutation but in reference to a different gene; for instance, if a mutation is in the upstream non-coding region of one gene, it may be in the coding region of a different gene.
Then, any genes that do not have any variants are added to the laboratorian report with various \u201cNA\u201d or \u201cWT\u201d values filling the appropriate fields.
This means that every gene in the TBDB appears in the Laboratorian report regardless if any mutations were identified in that gene.
Finally, a few more quality control measures are taken and then all of the individual Row objects are written to a CSV file, which concludes the creation of the laboratorian report.
"},{"location":"algorithm/technical/#creating-the-looker-report","title":"Creating the Looker report","text":"
The Parser class then creates a Looker object which uses the .create_looker_report() method. The Looker report uses the Laboratorian report to generate most of the included information.
It starts by iterating through a list of antimicrobial drugs and extracting all of the looker_interpretation values for each row in the report with that antimicrobial drug. It then identifies the highest resistance rating (R > R-Interim > U > S-Interim > S) for all resistance annotations for a drug.
Then, a quality check is performed and if a particular gene fails coverage that contributed to the highest resistance rating, an insufficient coverage warning is given.
The \"main_lin\" and \"sublin\" fields from the input JSON file are used to fill the ID field in the report. These fields are converted into shortened English without any technical lineage information.
Finally, the information is written to a CSV file which concludes the creation of the Looker report.
"},{"location":"algorithm/technical/#creating-the-lims-report","title":"Creating the LIMS report","text":"
The Parser class then creates LIMS object which uses the .create_lims_report() method. The LIMS report also uses the Laboratorian report to generate the bulk of the information included.
The .create_lims_report() method begins by iterating through each LIMS antimicrobial and gene code (corresponding to the LIMS codes in the CDPH STAR LIMS system). Then, the highest mdl_interpretation value is extracted for each row in the report that is associated with that antimicrobial drug, like in the Looker report. Then, the annotation is converted into a human-readable format (R \u2192 Mutations(s) associated with resistance to {antimicrobial} detected\u201d, etc.).
Then, the .apply_lims_rules() function is activated which determines which mutations should be output for the corresponding drug-gene combination. The mutations are then formatted so that they appear in the following format: {nucleotide mutation} ({amino acid mutation, if available}) repeated, separated by semicolons.
Some specific parsing rules apply to mutations within the rpoB gene, which changes the output language on the LIMS report. These rules depend on the position of the mutation in the gene.
After the rules are applied and the mutations are collected, the information is written to a CSV file which concludes the creation of the LIMS report.
"},{"location":"algorithm/technical/#creating-the-coverage-report","title":"Creating the coverage report","text":"
The Parser class then reuses the Coverage object created first and calls the .reformat_coverage() method which adds any warnings, such as any deletion mutations detected for a gene. If a deletion is detected, a warning is useful because it indicates that although the reported coverage is less than 100%, it may be due to that deletion. If the coverage is still 100% and a deletion was identified, the warning will say that the deletion may be upstream.
The coverage dictionary and the associated warnings are then written to a CSV file which concludes the creation of the coverage report, and the tbp-parser script.
The inputs on this page reflect the parameters that are applicable for the command-line tool. To see the inputs required for tbp-parser when run as part of the TheiaProk workflow series, please refer to the TheiaProk Inputs page.
tbp-parser is designed to run immediately after Jody Phelan\u2019s TB-Profiler tool. Only two inputs are required: the JSON file produced by TB-Profiler and the BAM file produced by TB-Profiler.
The JSON file contains information about the mutations detected in the sample: the quality, the type, and if that mutation confers resistance to an antimicrobial drug. The BAM file contains the alignment information for the sample and is needed for determining sequencing quality.
Parameter Description input_json The path to the JSON file that was produced by TB-Profiler input_bam The path to the BAM file that was produced by TB-Profiler
Info
The BAM file must have the accompanying BAI file in the same directory. It must also be named exactly the same as the BAM file but ending with a .bai suffix.
tbp-parser can be customized with a number of optional input parameters. These parameters can be used to control the quality control thresholds, the text that appears in the reports, and the names of the output files. The following is a list of all the input parameters that can be used with tbp-parser.
In addition to these arguments, tbp-parser also has a -h, --help argument that will out the list of possible arguments and their descriptions and a -v, --version argument that will print out the version of tbp-parser that is installed. Both of these commands exit the program after printing their output.
"},{"location":"inputs/inputs/#quality-control-arguments","title":"Quality Control Arguments","text":"
These options determine the thresholds for quality control.
Short Version Long Version Description Default Value -d --min_depth The minimum depth of coverage required for a site to pass QC 10 -c --min_percent_coverage The minimum percentage of a region that has depth above the threshold set by min_depth (used for a gene/locus to pass QC) 100 -s --min_read_support The minimum read support for a mutation to pass QC 10 -f --min_frequency The minimum frequency for a mutation to pass QC (0.1 -> 10%) 0.1 -r --coverage_regions A BED file containing the regions to calculate percent coverage for /data/tbdb-modified-regions.md"},{"location":"inputs/inputs/#text-arguments","title":"Text Arguments","text":"
These options are used verbatim in the reports, or are used to name the output files.
Short Version Long Version Description Default Value -m --sequencing_method The sequencing method used to gerneate the data; used in the LIMS & Looker reports. Enclose in quotes if including a space \"Sequencing method not provided\" -p --operator The operator who ran the analysis; used in the LIMS & Looker reports. Enclose in quotes if including a space \"Operator not provided\" -o --output_prefix The prefix to use for the output files. Do not include any spaces \"tbp-parser\""},{"location":"inputs/inputs/#lims-arguments","title":"LIMS Arguments","text":"
These options are used to customize the LIMS report
Name Description Default Value --add_cs_lims Adds Cycloserine (CS) fields to the LIMS report false"},{"location":"inputs/inputs/#tngs-specific-arguments","title":"tNGS-specific Arguments","text":"
These options are primarily used for tNGS data, although all frequency arguments are compatible with WGS data.
Name Description Default Value --tngs Indicates that the input data was generated using the Deeplex + CDPH modified protocol. Turns on tNGS-specific global parameters false --tngs_expert_regions A BED file containing the regions to calculate coverage for expert rule regions. This is used to determine coverage quality in the regions where resistance-conferring mutations are found, or where a CDC expert rule is applied. This is not used for QC purposes /data/tbdb-expert-regions.bed --rrs_frequency The minimum frequency for an rrs mutation to pass QC, as rrs has several problematic sites in the Deeplex tNGS assay 0.1 --rrl_frequency The minimum frequency for an rrl mutation to pass QC, as rrl has several problematic sites in the Deeplex tNGS assay 0.1 --rrs_read_support The minimum read support for an rrs mutation to pass QC, as rrs has several problematic sites in the Deeplex tNGS assay 10 --rrl_read_support The minimum read support for an rrl mutation to pass QC, as rrl has several problematic sites in the Deeplex tNGS assay 10 --rpob449_frequency The minimum frequency for an rpoB mutation at protein position 449 to pass QC, as this site is problematic in the Deeplex tNGS assay 0.1 --etha237_frequency The minimum frequency for an ethA mutation at protein position 237 to pass QC, as this site is problematic in the Deeplex tNGS assay 0.1"},{"location":"inputs/inputs/#logging-arguments","title":"Logging Arguments","text":"
These options change the verbosity of the stdout log | Name | Description | Default Value | | :--- | :---------- | :------------ | | --verbose | Increases the output verbosity to describe which stage of the analysis is currently running | false | | --debug | The highest level of output verbosity detailing every step of the analysis and logic implemented; overwrites --verbose | false |
"},{"location":"inputs/theiaprok/","title":"TheiaProk Inputs on Terra","text":"
When running tbp-parser as part of the TheiaProk workflow series (find documentation for TheiaProk here) on Terra.bio, an optional input must be activated to instruct TheiaProk to run tbp-parser.
tbp-parser is not on by default due to the nature of this tool and its outputs.
TheiaProk Version
This information only corresponds to PHB v2.2.0. These inputs and outputs may not be applicable to other versions of TheiaProk.
To activate tbp-parser you must set the following variable to true:
Terra Task name Variable Type Default value Description merlin_magictbprofiler_additional_outputs Boolean false Set to true to activate tbp-parser"},{"location":"inputs/theiaprok/#optional-inputs","title":"Optional Inputs","text":"
The following optional inputs are also available for user modification on Terra:
Terra Task name Variable Type Default value Description merlin_magictbp_parser_output_seq_method_type String \"WGS\" Fills out the \u201cseq_method\u201d field in the tbp_parser output files merlin_magictbp_parser_operator String \"Operator not provided\" The operator who ran the analysis; used in the LIMS & Looker reports merlin_magictbp_parser_min_depth Int 10 The minimum depth of coverage required for a site to pass QC merlin_magictbp_parser_min_frequency Int 0.1 The minimum frequency for a mutation to pass QC (0.1 -> 10%) merlin_magictbp_parser_min_read_support Int 10 The minimum read support for a mutation to pass QC merlin_magictbp_parser_coverage_threshold Int 100 The minimum percentage of a region that has depth above the threshold set by min_depth (used for a gene/locus to pass QC) merlin_magictbp_parser_coverage_regions_bed File tbdb-modified-regions.md A BED file containing the regions to calculate percent coverage for merlin_magictbp_parser_debug Boolean false Turn on debug mode for tbp-parser merlin_magictbp_parser_add_cs_lims Boolean false Adds Cycloserine (CS) fields to the LIMS report merlin_magictbp_parser_docker_image String \"us-docker.pkg.dev/general-theiagen/theiagen/tbp-parser:1.6.0\" The Docker image to use when running tbp-parser
Find the outputs for tbp-parser in TheiaProk on Terra here.
tbp-parser produces four files as outputs. See each individual page for more details on how they are constructed and what they contain:
Laboratorian report
LIMS report
Looker report
Coverage report
The four reports contain a wealth of information. The reports can be ordered from increasing to decreasing verbosity as follows: the laboratorian report, the LIMS report, the Looker report, and the coverage report. The same information is used in all four reports but at differing levels of verbosity.
Running tbp-parser as part of TheiaProk on Terra produces additional outputs. You can find that information in the TheiaProk Outputs on Terra page.
The coverage report lists every gene and its percent gene coverage over a minimum depth (default: 10) relative to the H37Rv genome.
Please note that user-provided coverage regions always take precedence over default values.
"},{"location":"outputs/coverage/#wgs-coverage-report","title":"WGS Coverage Report","text":"Column name Explanation Gene The name of the gene or locus Percent_Coverage The percent of the gene\u2019s coding region that has a read depth over the minimum value (default: 10; user-customizable by altering --min_depth) Warning Indicates if any deletions were identified in the gene which may contribute to lower than expected coverage
If run using the TheiaProk workflow series, there will be an additional column that contains only the name of the sample, which is useful when concatenating many reports as it helps differentiate which gene belongs to which sample.
If the --tngs flag is used, the report contains the following fields:
Column name Explanation Gene The name of the gene or locus Coverage_Breadth_reportableQC_region The percent of the gene (positions determined by the regions covered by the tNGS Deeplex + CDPH assay primers that are considered reportable by CDPH) that is covered at a depth greater than the --min_depth value QC_Warning Indicates if any deletions were identified in the gene which may contribute to lower than expected coverage Coverage_Breadth_R_expert-rule_region The percent of the regions (positions that could contain any resistance-conferring mutations or require expert-rule application) that is covered at a depth greater than the --min_depth value
Coverage regions are determined with either the default \"../data/tbdb-modified-regions.bed\" (collected on Sep 1, 2023 from the TBProfiler repository, or if --tngs, \"../data/tngs-reportable-regions.bed\".
The R-expert rule region is determined only if --tngs is indicated and uses the ranges in \"../data/tngs-expert-rule-regions.bed\".
The laboratorian report is the main report produced by tbp-parser and is used to generate all of the other reports. What follows is an explanation of all the columns in the report.
"},{"location":"outputs/laboratorian/#explanation-of-column-headers","title":"Explanation of column headers","text":"Column name Explanation sample_id The name of the sample tbprofiler_gene_name The name of the gene where the mutation has been identified tbprofiler_locus_tag The locus tag for the mutation that has been identified tbprofiler_variant_substitution_type The type of mutation identified, whether or not it was a frameshift, missense, or synonymous mutation tbprofiler_variant_substitution_nt The mutation in nucleotide format tbprofiler_variant_substitution_aa The mutation in amino acid format, if possible confidence Contains either:- the WHO annotation- an indication that there was no WHO annotation- NA for when there is no mutation antimicrobial The antimicrobial drug that may be affected by this mutation looker_interpretation The drug resistance interpretation intended for the Looker report mdl_interpretation The drug resistance interpretation intended for the LIMS report depth The depth of coverage at the mutation frequency The frequency of the mutation in the reads read_support How many reads support the mutation (depth * frequency) rationale Contains an indication of what was used (the WHO annotation, the specific expert rule used, or neither) to create the two interpretations warning Any potential quality warnings that may indicate lower reliability gene_tier The gene tier of the mutation\u2019s gene (Tier 1, Tier 2, or NA)
Because of how a particular mutation may contribute resistance to different drugs at the same time, each mutation is listed multiple times, once for each antimicrobial drug that could be affected. In addition, any genes that do not have any mutations are also included in the laboratorian report with NA or WT in the appropriate field. This results in a report with many rows and often, rows with very similar values. However, the laboratorian report contains the \u201ccomplete picture\u201d of the sample and is incredibly useful for understanding the sample\u2019s drug resistance profile.
The LIMS report is intended for direct import into a STAR LIMS system. The columns are in the specific LIMS code format for CDPH, and may not apply to your LIMS system. Please contact us if you need different column headers and we can work with you towards a solution.
"},{"location":"outputs/lims/#explanation-of-column-headers","title":"Explanation of column headers","text":"Column name Explanation MDL sample accession numbers The name of the sample M_DST_A01_ID The lineage of the sample in human-readable language M_DST_B01_INH The highest mdl_interpretation resistance identified for mutations associated with this drug (isoniazid) M_DST_B02_katG Any non-S mutations found in this gene with good quality responsible for the predicted resistance for ethionamideresponsible for the predicted resistance for isoniazid M_DST_B03_fabG1 Any non-S mutations found in this gene with good quality responsible for the predicted resistance for isoniazid M_DST_B04_inhA Any non-S mutations found in this gene with good quality responsible for the predicted resistance for isoniazid M_DST_C01_ETO The highest mdl_interpretation resistance identified for mutations associated with this drug (ethionamide) M_DST_C02_ethA Any non-S mutations found in this gene with good quality responsible for the predicted resistance for ethionamide M_DST_C03_fabG1 Any non-S mutations found in this gene with good quality responsible for the predicted resistance for ethionamide M_DST_C04_inhA Any non-S mutations found in this gene with good quality responsible for the predicted resistance for ethionamide M_DST_D01_RIF The highest mdl_interpretation resistance identified for mutations associated with this drug (rifampin) M_DST_D02_rpoB Any non-S mutations found in this gene with good quality responsible for the predicted resistance for rifampin M_DST_E01_PZA The highest mdl_interpretation resistance identified for mutations associated with this drug (pyrazinamide) M_DST_E02_pncA Any non-S mutations found in this gene with good quality responsible for the predicted resistance for pyrazinamide M_DST_F01_EMB The highest mdl_interpretation resistance identified for mutations associated with this drug (ethambutol) M_DST_F02_embA Any non-S mutations found in this gene with good quality responsible for the predicted resistance for ethambutol M_DST_F03_embB Any non-S mutations found in this gene with good quality responsible for the predicted resistance for ethambutol M_DST_G01_AMK The highest mdl_interpretation resistance identified for mutations associated with this drug (amikacin) M_DST_G02_rrs Any non-S mutations found in this gene with good quality responsible for the predicted resistance for amikacin M_DST_G03_eis Any non-S mutations found in this gene with good quality responsible for the predicted resistance for amikacin M_DST_H01_KAN The highest mdl_interpretation resistance identified for mutations associated with this drug (kanamycin) M_DST_H02_rrs Any non-S mutations found in this gene with good quality responsible for the predicted resistance for kanamycin M_DST_H03_eis Any non-S mutations found in this gene with good quality responsible for the predicted resistance for kanamycin M_DST_I01_CAP The highest mdl_interpretation resistance identified for mutations associated with this drug (capreomycin) M_DST_I02_rrs Any non-S mutations found in this gene with good quality responsible for the predicted resistance for capreomycin M_DST_I03_tlyA Any non-S mutations found in this gene with good quality responsible for the predicted resistance for capreomycin M_DST_J01_MFX The highest mdl_interpretation resistance identified for mutations associated with this drug (moxifloxacin) M_DST_J02_gyrA Any non-S mutations found in this gene with good quality responsible for the predicted resistance for moxifloxacin M_DST_J03_gyrB Any non-S mutations found in this gene with good quality responsible for the predicted resistance for moxifloxacin M_DST_K01_LFX The highest mdl_interpretation resistance identified for mutations associated with this drug (levofloxacin) M_DST_K02_gyrA Any non-S mutations found in this gene with good quality responsible for the predicted resistance for levofloxacin M_DST_K03_gyrB Any non-S mutations found in this gene with good quality responsible for the predicted resistance for levofloxacin M_DST_L01_BDQ The highest mdl_interpretation resistance identified for mutations associated with this drug (bedaquiline) M_DST_L02_Rv0678 Any non-S mutations found in this gene with good quality responsible for the predicted resistance for bedaquiline M_DST_L03_atpE Any non-S mutations found in this gene with good quality responsible for the predicted resistance for bedaquiline M_DST_L04_pepQ Any non-S mutations found in this gene with good quality responsible for the predicted resistance for bedaquiline M_DST_L05_mmpL5 Any non-S mutations found in this gene with good quality responsible for the predicted resistance for bedaquiline M_DST_L06_mmpS5 Any non-S mutations found in this gene with good quality responsible for the predicted resistance for bedaquiline M_DST_M01_CFZ The highest mdl_interpretation resistance identified for mutations associated with this drug (clofazimine) M_DST_M02_Rv0678 Any non-S mutations found in this gene with good quality responsible for the predicted resistance for clofazimine M_DST_M03_pepQ Any non-S mutations found in this gene with good quality responsible for the predicted resistance for clofazimine M_DST_M04_mmpL5 Any non-S mutations found in this gene with good quality responsible for the predicted resistance for clofazimine M_DST_M05_mmpS5 Any non-S mutations found in this gene with good quality responsible for the predicted resistance for clofazimine M_DST_N01_LZD The highest mdl_interpretation resistance identified for mutations associated with this drug (linezolid) M_DST_N02_rrl Any non-S mutations found in this gene with good quality responsible for the predicted resistance for linezolid M_DST_N03_rplC Any non-S mutations found in this gene with good quality responsible for the predicted resistance for linezolid Analysis date The date tbp-parser was run in YYYY-MM-DD HH:SS format Operator The name of the person who ran tbp-parser; can be provided with the --operator input parameter. If left blank, \u201cOperator not provided\u201d is the default value. M_DST_O01_lineage The lineage of the sample (the main_lin of the sample as reported by TBProfiler) M_DST_P01_CS The highest mdl_interpretation resistance identified for mutations associated with this drug (cycloserine); only included when --add_cs_lims is set to true M_DST_P02_ald Any non-S mutations found in this gene with good quality responsible for the predicted resistance for cycloserine; only included when --add_cs_lims is set to true M_DST_PO3_alr Any non-S mutations found in this gene with good quality responsible for the predicted resistance for cycloserine; only included when --add_cs_lims is set to true
The LIMS report offers a condensed version of the laboratorian report with more details than the Looker report. By containing only the most important information about a drug and its related mutations, the LIMS report provides an invaluable summary.
The Looker report is intended for use in Google's Looker Studio Data Studio for dashboarding purposes. It offers a highly condensed version of the resistance calls (using the looker_interpretation field from the laboratorian report) for a quick summary of the sample\u2019s drug resistance profile.
"},{"location":"outputs/looker/#explanation-of-column-headers","title":"Explanation of column headers","text":"Column name Explanation sample_id The name of the sample output_seq_method_type The sequencing method used to generate the data; can be set with the --sequencing_method input parameter. If left blank, \u201cSequencing method not provided\u201d is the default value amikacin The highest looker_interpretation resistance identified for mutations associated with this drug bedaquiline The highest looker_interpretation resistance identified for mutations associated with this drug capreomycin The highest looker_interpretation resistance identified for mutations associated with this drug clofazimine The highest looker_interpretation resistance identified for mutations associated with this drug ethambutol The highest looker_interpretation resistance identified for mutations associated with this drug ethionamide The highest looker_interpretation resistance identified for mutations associated with this drug isoniazid The highest looker_interpretation resistance identified for mutations associated with this drug kanamycin The highest looker_interpretation resistance identified for mutations associated with this drug levofloxacin The highest looker_interpretation resistance identified for mutations associated with this drug linezolid The highest looker_interpretation resistance identified for mutations associated with this drug moxifloxacin The highest looker_interpretation resistance identified for mutations associated with this drug pyrazinamide The highest looker_interpretation resistance identified for mutations associated with this drug rifampin The highest looker_interpretation resistance identified for mutations associated with this drug streptomycin The highest looker_interpretation resistance identified for mutations associated with this drug lineage The lineage of the sample (the main_lin field as reported by TB-Profiler); for example, lineage1.2.1.2.1 ID The lineage of the sample in human-readable language (the same as M_DST_A01_ID in the LIMS report) analysis_date The date tbp-parser was run in YYYY-MM-DD HH:SS format operator The name of the person who ran tbp-parser; can be provided with the --operator input parameter. If left blank, \u201cOperator not provided\u201d is the default value.
Please note that occasionally, the looker_interpretation field can differ from the mdl_interpretation field. Typically, they are identical, but occasionally, the mdl_interpretation column will call a variant-drug combination \u201csusceptible\u201d (S), while the looker_interpretation column will call the same combination \u201cuncertain\u201d (U). Be aware of this difference when choosing an interpretation to report.
"},{"location":"outputs/theiaprok/","title":"TheiaProk Outputs on Terra","text":"
When running tbp-parser as part of the TheiaProk workflow series (find documentation for TheiaProk here) on Terra.bio, you will find the following outputs in your data table.
TheiaProk Version
This information only corresponds to PHB v2.2.0. These inputs and outputs may not be applicable to other versions of TheiaProk.
Variable Type Description tbp_parser_average_genome_depth Float The average depth of coverage across the H37Rv reference genome tbp_parser_coverage_report File The coverage report generated by tbp-parser tbp_parser_docker String The Docker image used to run tbp-parser tbp_parser_genome_percent_coverage Float The percentage of the H37Rv reference genome that has depth above the threshold set by tbp_parser_min_depth tbp_parser_laboratorian_report_csv File The laboratorian report generated by tbp-parser tbp_parser_lims_report_csv File The LIMS report generated by tbp-parser tbp_parser_looker_report_csv File The Looker report generated by tbp-parser tbp_parser_version String The version of tbp-parser used in the analysis as determined by tbp-parser --version
Find the inputs for tbp-parser in TheiaProk on Terra here.
"},{"location":"versioning/","title":"Versioning and Releases","text":"
The California Department of Public Health has clinically validated the following versions:
v1.2.2 for WGS, and
v1.4.4.8 for tNGS
Interpretation documents for v1.2.2 and v1.4.4.8 are available in the root directory of the tbp-parser repository; others are available in the interpretation_docs directory on GitHub.
If you are running tbp-parser as part of the TheiaProk pipeline(s) with Terra, the following branches are recommended:
To run v1.2.2 on Terra, please use the smw-tb-2024-01-16-dev branch.
To run v1.4.4.8+ and v1.6.x+, please use the smw-tb-2024-05-03-dev branch.
To run v1.5.x+ and v2.x+, please use the smw-tb-2024-05-03-who2-dev branch.
For more information on the differences between versions, you can see the Brief Description of Versions or the Exhaustive List of Versions.
"},{"location":"versioning/brief/","title":"Brief Description of Versions","text":"
You may notice there are many releases; tbp-parser is in active development and each release is \"use at your own risk.\" We highly recommend upgrading to the latest release as they include important bug fixes. In order to help track the different changes, we have included a brief description of each release:
v1.2.x & below - the initial developmental stages of tbp-parser for WGS data
v1.3.x - the addition of tNGS data parsing and includes some updates applicable to WGS parsing
v1.4.x - reworks how QC is performed (changes in order of operations)
v1.4.3+ - changes how tNGS lineage determination is performed
v1.4.4+ - changes how nonsynonymous mutations are interpretted; major interpretation differences between earlier versions
v1.6.x - only considers the genes included in the LIMS report to determine the drug output in the LIMS report
v1.5.x+ and v2.0.0 - major changes to code in due to using results from TB-Profiler v6.2.0+
code changes for v2.x are available on the who-v2 branch of tbp-parser
For a more exhaustive list, please visit the Exhaustive List of Versions.
"},{"location":"versioning/exhaustive/","title":"Exhaustive version descriptions","text":"
The following is a list of every version of tbp-parser and a short summary of the changes made in each version.
Blue indicates that CDPH performed a clinical validation on that version
v1.0.0 - initial version
v1.1.0 - adjusts the highest interpretation for a drug to only consider genes in LIMS report, adds the rule to the confidence column, adds QRDR expert rules for gyrA and gyrB
v1.1.1 - fixes a bug in R/QRDR region calculations
v1.1.2 - adjusts LIMS lineage designation by checking for BCG and if lineage from TB Profiler is empty
v1.1.3 - now includes the TB Profiler sublineage output when determining BCG M bovis
v1.1.4 - now checks if multiple lineages/sublineages were detected
v1.1.5 - checks all mmpS/mmpL/mmpR alternate consequences; also checks to make sure all drugs are reported
v1.1.5.1 - renames rifampicin to rifampin
v1.1.6 - removes a locus warning with deletion caveat
v1.1.7 - ensures all deletion caveat locus warnings are gone, overwrites all fields with locus warning with \u201cNA\u201d or \u201cInsufficient Coverage\u201d as appropriate and moves them to the bottom of the Laboratorian report
v1.1.8 - changes overwrite to only overwrite interpretation values, not mutation information
v1.1.9 - renames rifampicin to rifampin
v1.2.0 - enables ability to provide alternate coverage bed file; introduced the modified regions (just coding region + 30bp upstream or promoter region)
v1.2.1 - fixes a bug when renaming rifampicin to rifampin
v1.2.2 (WGS) - improve how maximum MDL interpretation is calculated for the LIMS report. Use the smw-tb-2024-01-16-dev branch on Terra.
v1.2.3 - check only the LIMS genes\u2019 coverage for LIMS lineage determination and use a threshold for all lineage designation
v1.3.0 - adds tNGS regions, checks to make sure that only variants for genes in the coverage report are included in the laboratorian (tNGS), error-proof locus tag designation, add check to prevent failures when gene not in coverage dictionary (tNGS), adds \u201cNA\u201d to the mutation rank list (score = 0, same as Insufficient Coverage)
v1.3.1 - adds --tngs flag to turn on tNGS-specific global parameters, establishes different threshold calculation for lineage designation for tNGS, checks the segment of a gene a variant was detected in, removes check that did not prevent failures when gene not in coverage dictionary from v1.3.0, error-proof all coverage checks, adds \u201cThis mutation is outside the expected region\u201d warning
v1.3.2 - error-proofs coverage warning and adds additional section for tNGS gene segments, error-proofs gene tier for tNGS gene segments
v1.3.3 - condenses most gene segments into one, for WT mutations, set the mutation to \u201cWT\u201d not \u201cNA\u201d
v1.3.4 - error-proofs maximum mdl interpretation determination and maximum looker interpretation determination
v1.3.5 - adds rrs & rrl frequency input parameters to customize mutation frequency for those genes , overwrites gene MDL interpretation when \u201cInsufficient Coverage\u201d to act as if \u201cWT\u201d if greater than S
v1.3.6 - adds the TB-Profiler lineage to the end of the LIMS report and the Looker report, adds LIMS lineage to Looker report, introduces check if max MDL interpretation is also Insufficient Coverage to change output to Pending Retest
v1.3.7 - add to the coverage report the \u201cexpert rule regions\u201d column for tNGS, overwrites gene MDL interpretation when \u201cInsufficient Coverage\u201d to act as if \u201cWT\u201d if gr **eater than or equal to S
v1.3.8 - add frequency input parameters for rpoB 449 and ethA 237, renames coverage threshold to minimum percent coverage
v1.3.9 - check if gene name is rpoB because that means it\u2019s outside the expected region (tNGS - rpoB is in two segments), add rrs and rrl read support input parameters
v1.4.0 - rework how QC is performed (order of operations)
v1.4.1 - remove rpoB expected region check, implements deletion position quality check in QC (keep only valid deletions), if outside expected region warning, set MDL interpretations to NA
v1.4.2.1 (same change in v1.5.4) - prevent overwriting \u201cR\u201d mutations with No Sequence, and overwrite \u201cU\u201d mutations with \u201cPending Retest\u201d if bad quality
v1.4.3 - implement different thresholds for LIMS lineage identification for tNGS,
v1.4.4 - update expert rule interpretations (mainly S \u2192 U in several spots)
v1.4.4.1 (v1.5.0 branched off of this one)- update LIMS threshold to 90, not the coverage threshold
v1.4.4.2 (same change in v1.5.1) - fix an issue where \u201cNo sequence\u201d was not triggering Pending Retest
v1.4.4.3 (same change in v1.5.5) - fix an issue where \u201cPending Retest\u201d was not properly appearing
v1.4.4.4 (same change in v1.5.6) - prevent \u201cPending Retest\u201d if Insufficient Coverage is in a gene that also has a valid deletion
v1.4.4.5 - consider deletions invalid if coverage is between 0 and minimum coverage (10 default) (this consideration is unique to old TB Profiler and not mimicked in v1.5)
v1.4.4.6 - a mistake; updates the version (this release is a mystery to me as there is nothing in there except version update)
v1.4.4.7 (same change in v1.5.8) - change tNGS LIMS lineage designation to items in the coverage dictionary (to represent both rpoB segments)
v1.4.4.8 (tNGS) (same change in v1.5.9)- reduce tNGS LIMS threshold to 70% from 90. Use the smw-tb-2024-05-03-dev branch on Terra for this and all subsequent v1.4.4.x+ versions.
v1.4.4.9 (same change in v1.5.7) - add optional input to add cycloserine to LIMS report
v1.4.4.10 - fix issue when MDL resistance was being overwritten to Pending Retest but without considering other genes when calculating the highest MDL resistance (as the other genes may have had higher resistances that were not captured at first)
v1.4.4.11 - fix issue introduced by last fix where we ran into indexing errors due to no more MDL interpretations available in the list
v1.5.0 (branched off of v1.4.4.1)- make all language changes necessary to be compatible with TBProfiler v6.2.1. Use the smw-tb-2024-05-03-who2-dev branch on Terra for this and all subsequent v1.5.x+ versions.
v1.5.1 (same change in v1.4.4.2)- fix an issue where \u201cNo sequence\u201d was not triggering Pending Retest
v1.5.2 - a mistake; somehow exactly the same as 1.4.4.2?? (this release is also a mystery)
v1.5.3 - make additional language changes and fix an unusual edge case where the same mutation was identified; rename mmpR5 to Rv0678 again
v1.5.4 (same change in v1.4.2.1) - prevent overwriting \u201cR\u201d mutations with No Sequence
v1.5.5 (same change in v1.4.4.3 - fix an issue where \u201cPending Retest\u201d was not properly appearing; consider only LIMS genes for LIMS reort
v1.5.6 (same change in v1.4.4.4) - prevent \u201cPending Retest\u201d if Insufficient Coverage is in a gene that also has a valid deletion
v1.5.7 (same change in v1.4.4.9) - add optional input to add cycloserine to LIMS report
v1.5.8 (same change in v1.4.4.7) - change tNGS LIMS lineage designation to check items in the coverage dictionary (to represent both rpoB segments; percentage calculation erroneously combined them)
v1.5.9 (same change in v1.4.4.8) - reduce tNGS LIMS threshold to 70% from 90
v1.5.10 - correct spelling of two genes in the LIMS report for cycloserine
v1.6.0 (branched off of v1.4.4.11) - ensures that only LIMS genes are being considered for the LIMS report. Use the smw-tb-2024-05-03-dev branch on Terra for this and all subsequent v1.6.x+ versions.
v2.0.0 (branched off of v1.5.10; same change in v1.4.4.10 and v1.4.4.11) - fix issue when MDL resistance was being overwritten to Pending Retest but without considering other genes when calculating the highest MDL resistance (as the other genes may have had higher resistances that were not captured at first) and fixes the resulting issue where indexing errors occurred\u00a0due to no more MDL interpretations. Use the smw-tb-2024-05-03-who2-dev branch on Terra for this and all subsequent v2.x+ versions.
The following diagram shows how each version is related to the others without technical details:
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"tbp-parser","text":"
Not for Diagnostic Use
CAUTION: The information produced by this program should not be used for clinical reporting unless and until extensive validation has occured in your laboratory on a stable version. Otherwise, the outputs of tbp-parser are for research use only.
FUTURE DEPRECATION NOTICE
At the time of the PHB v2.3.0 release:
all branches on Terra that have been mentioned in this documentation will be deleted. Please use the v2.3.0 version of TheiaProk moving forward.
the main branch of tbp-parser will host v2.1.0 and above; earlier versions of tbp-parser will no longer be supported
future releases of tbp-parser will only support outputs generated by TBProfiler v6.0.0 and above.
Versions of TBProfiler prior to v6.0.0 are not compatible with v2+ of tbp-parser. Please ensure that you are using the correct version of tbp-parser for your version of TBProfiler.
tbp-parser is a tool developed in partnership with the California Department of Health (CDPH) to parse the output of Jody Phelan\u2019s TBProfiler tool into four additional files:
A Laboratorian report, which contains information about each mutation detected and its associated drug resistance profile in a CSV file.
A LIMS report, formatted specifically for CDPH\u2019s STAR LIMS, which summarizes the highest severity mutations for each antimicrobial drug and the relevant mutations.
A Looker report, which condenses the information contained in the Laboratorian report into a format suitable for generating a dashboard in Google\u2019s Looker Studio.
A coverage report, which contains the percent coverage of each gene relative to the H37Rv reference genome in addition to any warnings, such as any deletions identified in the gene that might have contributed to a reduced percent coverage
Please reach out to us at support@theiagen.com if you would like any custom file formats and/or changes to these output files that suit your individual needs.
We host our Docker images on the Google Artifact Registry so that they are always availble for usage.
The entrypoint for this Docker image is the tbp-parser help message. To run this container interactively, you can use the following command:
docker run -it --entrypoint=/bin/bash us-docker.pkg.dev/general-theiagen/theiagen/tbp-parser:2.1.0\n\n# Once inside the container interactively, you can run the tbp-parser tool\npython3 /tbp-parser/tbp_parser/tbp_parser.py -v\n# v2.1.0\n
"},{"location":"usage/#locally-with-python","title":"Locally with Python","text":"
tbp-parser is not yet available with pip or conda. To run tbp-parser in your local command-line environment, install the following dependencies:
python3
pandas >= 1.4.2
importlib_resources
samtools
After installation of these dependencies, download and extract the latest release of tbp-parser and run the script with python3.
The help message printed by tbp-parser is quite extensive, but has a lot of useful information regarding the input parameters. Here is the entire message in full. You can find more information regarding these inputs in the Inputs section.
usage: python3 /tbp-parser/tbp_parser/tbp_parser.py [-h|-v] <input_json> <input_bam> [<args>]\n\nParses Jody Phelon's TB-Profiler JSON output into four files:\n- a Laboratorian report,\n- a LIMS report\n- a Looker report, and\n- a coverage report\n\npositional arguments:\n input_json\n the JSON file produced by TBProfiler\n input_bam\n the BAM file produced by TBProfiler\n\noptional arguments:\n -h, --help\n show this help message and exit\n -v, --version\n show program's version number and exit\n\nquality control arguments:\n options that determine what passes QC\n\n -d, --min_depth\n the minimum depth of coverage for a site to pass QC\n default=10\n -c, --min_percent_coverage\n the minimum percentage of a region that has depth above the threshold set by min_depth\n (used for a gene/locus to pass QC)\n default=100\n -s, --min_read_support\n the minimum read support for a mutation to pass QC\n default=10\n -f, --min_frequency\n the minimum frequency for a mutation to pass QC (0.1 -> 10%)\n default=0.1\n -r, --coverage_regions\n the BED file containing the regions to calculate percent coverage for\n default=data/tbdb-modified-regions.bed\n\ntext arguments:\n arguments that are used verbatim in the reports or to name the output files\n\n -m, --sequencing_method\n the sequencing method used to generate the data; used in the LIMS & Looker reports\n ** Enclose in quotes if includes a space\n default=\"Sequencing method not provided\"\n -p, --operator\n the operator who ran the sequencing; used in the LIMS & Looker reports\n ** Enclose in quotes if includes a space\n default=\"Operator not provided\"\n -o, --output_prefix\n the output file name prefix\n ** Do not include any spaces\n\ntNGS-specific arguments:\n options that are primarily used for tNGS data\n (all frequency arguments are compatible with WGS data)\n\n --tngs\n indicates that the input data was generated using Deeplex + CDPH modified protocol\n Turns on tNGS-specific global parameters\n --tngs_expert_regions\n the BED file containing the regions to calculate coverage for expert rule regions\n (used to determine coverage quality in the regions where resistance-conferring\n mutations are found, or where a CDC expert rule is applied; not for QC)\n default=data/tngs-expert-rule-regions.bed\n --rrs_frequency\n the minimum frequency for an rrs mutation to pass QC\n (rrs has several problematic sites in the Deeplex tNGS assay)\n default=0.1\n --rrl_frequency\n the minimum frequency for an rrl mutation to pass QC\n (rrl has several problematic sites in the Deeplex tNGS assay)\n default=0.1\n --rpob449_frequency\n the minimum frequency for an rpoB mutation at protein position 449 to pass QC\n (this is a problematic site in the Deeplex tNGS assay)\n default=0.1\n --etha237_frequency\n the minimum frequency for an ethA mutation at protein position 237 to pass QC\n (this is a problematic site in the Deeplex tNGS assay)\n default=0.1\n\nlogging arguments:\n options that change the verbosity of the stdout log\n\n --verbose\n increase output verbosity\n --debug\n increase output verbosity to debug; overwrites --verbose\n\nPlease contact support@theiagen.com with any questions\n
Resistance calls are made in either one of two ways. The first is using the WHO annotation, which is output directly from the TBProfiler. The WHO has a catalogue of mutations and how they may confer antimicrobial resistance. If this annotation is present, it will always be used.
In the case where the WHO annotation is missing, either due to novel mutations or mutations with unclear significance in the literature, tbp-parser will apply expert rules. These expert rules are additional conditions used to decide if a mutation is considered to confer resistance or not. These expert rules come from the CDC and can be found documented in the tbp-parser GitHub repository inside the interpretation logic PDFs.
When an expert rule is applied, the rationale field of the laboratorian report will indicate which expert rule was used (the number prefacing the rule directly correlates to the appropriate section in the interpretation logic PDF) and indicate that there was no WHO annotation.
The interpretation documents for v1.2.2 and v1.4.4.8 are available in the root directory of the tbp-parser repository. Versions that correspond to different releases are available in the interpretation_docs directory on GitHub.
The examples in this document are based on the output of TBProfiler v4.4.2. However, the general principles apply to all versions of TBProfiler and tbp-parser.
tbp-parser is object-oriented, with each class representing either an output file, a part of an output file, or a part of the input JSON file produced by TBProfiler.
The first class that is invoked by the tbp-parser.py script is Parser which is a control class that orchestrates the creation of the different output reports.
Before creating any reports, Parser calls the Coverage class to calculate the percent gene coverage over a specified minimum depth (default: 10) for the coding regions of all genes included in the TBDB (the database used in TBProfiler to generate the drug resistance annotations). This requires as input the BAM and BAI files produced by TBProfiler during alignment to the H37Rv reference genome. The percent gene coverage results are then stored in a global dictionary that is accessed multiple times for QC purposes during the creation of the final reports.
"},{"location":"algorithm/technical/#creating-the-laboratorian-report","title":"Creating the Laboratorian report","text":"
Then, Parser creates the Laboratorian report using the Laboratorian class and its associated .create_laboratorian_report() method.
The Laboratorian class uses the input JSON file to collect the necessary information. The structure of the input JSON file is a good place to start the breakdown:
In this example, we can see only the relevant top-level JSON fields that are used in tbp-parser.
Of interest, the \"id\" column is used to set the global SAMPLE_NAME variable.
The lineage information, found in \"main_lin\" and \"sublin\" are used in the LIMS and Looker reports, so we won\u2019t go into detail about them here.
The variant information is what makes up the bulk of the Laboratorian report and can be found in the \"dr_variants\" and \"other_variants\" fields. We\u2019ll talk more about these fields later.
There are many other fields that are omitted from this example since they are not used in tbp-parser, such as version information and overall sample drug resistance type (like RR-TB, etc.). These fields are found in the other TBProfiler output files in more human-readable formats.
Within the input JSON file, there are two fields that are examined the most: \"dr_variants\" and \"other_variants\". These fields are treated the same, and have the same format, although different mutations are found in both regions. The difference between the two fields is unclear to me at this time. In the example below, only the fields used in tbp-parser are shown.
After the global SAMPLENAME variable is set, the Laboratorian class calls the .iterate_section() method, starting with the \"dr_variants\" field.
Since the contents of each variant section in the JSON dictionary are considered a list, we start to iterate through each list item, which consists of each section within curly brackets {...}. In the example to the left, I\u2019ve only included 1 item in each list.
Immediately, each item in the list is converted into a Variant class object, and every item in each list item (the \"chrom\", \"genome_pos\", \"locus_tag\", etc.) is converted to a class attribute. This is because each item in the list represents a single mutation or a single variant. I\u2019ll now refer to each variant section item as a Variant.
Each new Variant object has the .extract_annotations() method called. This method starts by iterating through the \"annotation\" field in the input JSON. The annotation field can contain multiple different annotations, so we look at each one individually.
Each annotation is turned into a Row object, which represents a row in the Laboratorian report. During the initiation of the Row object, each column in the Laboratorian report is created based on both the annotation field and the originating Variant object. Additionally, a warning field is created based on both the global dictionary created with the Coverage class and the mutation\u2019s \"depth\" and \"freq\" fields.
Sometimes multiple annotations for the same drug can appear for a single Variant. If this is the case, only the most severe annotation is saved (that is, an annotation that indicates resistance is kept instead of one that indicates susceptibility).
After the annotation field has been iterated through, we then check the \"gene_associated_drugs\" field to make sure that we create a Row for each antimicrobial drug that is associated with the gene. As you can see in the \"other_variants\" section, the annotation field for the variant only lists annotations for moxifloxacin and levofloxacin, but the gene is associated with three other antimicrobial drugs. This iteration creates additional Row objects for those antimicrobial drugs.
This means that each mutation will potentially appear several times in the final report, once for every antimicrobial associated with the drug. This is because sometimes a mutation confers a different resistance level to one drug, but not another.
After Row objects are created for each Variant in the variant section, every Row has the .complete_row() method called, which adds the interpretation columns to the object. Two interpretation columns are created, mdl_interpretation and looker_interpretation.
Please note that these interpretation columns are typically identical, but in several cases, the mdl_interpretation column will call a variant-drug combination as \u201csusceptible\u201d (S), while the looker_interpretation column will call the same combination \u201cuncertain\u201d (U).
In the case where a WHO annotation was not identified, the Variant class\u2019 .apply_expert_rules() method is called. This function applies expert rules that are listed in detail on the tbp-parser GitHub repository, available here.
The expert rules assign a drug resistance call to the variant-drug combination only when there is no WHO annotation and will fill the mdl_interpretation and looker_interpretation fields.
If the mutation is in either mmpS5, mmpL5, or mmpR5/Rv0678, then the \"alternate_consequences\" field is iterated through. This field typically lists the same mutation but in reference to a different gene; for instance, if a mutation is in the upstream non-coding region of one gene, it may be in the coding region of a different gene.
Then, any genes that do not have any variants are added to the laboratorian report with various \u201cNA\u201d or \u201cWT\u201d values filling the appropriate fields.
This means that every gene in the TBDB appears in the Laboratorian report regardless if any mutations were identified in that gene.
Finally, a few more quality control measures are taken and then all of the individual Row objects are written to a CSV file, which concludes the creation of the laboratorian report.
"},{"location":"algorithm/technical/#creating-the-looker-report","title":"Creating the Looker report","text":"
The Parser class then creates a Looker object which uses the .create_looker_report() method. The Looker report uses the Laboratorian report to generate most of the included information.
It starts by iterating through a list of antimicrobial drugs and extracting all of the looker_interpretation values for each row in the report with that antimicrobial drug. It then identifies the highest resistance rating (R > R-Interim > U > S-Interim > S) for all resistance annotations for a drug.
Then, a quality check is performed and if a particular gene fails coverage that contributed to the highest resistance rating, an insufficient coverage warning is given.
The \"main_lin\" and \"sublin\" fields from the input JSON file are used to fill the ID field in the report. These fields are converted into shortened English without any technical lineage information.
Finally, the information is written to a CSV file which concludes the creation of the Looker report.
"},{"location":"algorithm/technical/#creating-the-lims-report","title":"Creating the LIMS report","text":"
The Parser class then creates LIMS object which uses the .create_lims_report() method. The LIMS report also uses the Laboratorian report to generate the bulk of the information included.
The .create_lims_report() method begins by iterating through each LIMS antimicrobial and gene code (corresponding to the LIMS codes in the CDPH STAR LIMS system). Then, the highest mdl_interpretation value is extracted for each row in the report that is associated with that antimicrobial drug, like in the Looker report. Then, the annotation is converted into a human-readable format (R \u2192 Mutations(s) associated with resistance to {antimicrobial} detected\u201d, etc.).
Then, the .apply_lims_rules() function is activated which determines which mutations should be output for the corresponding drug-gene combination. The mutations are then formatted so that they appear in the following format: {nucleotide mutation} ({amino acid mutation, if available}) repeated, separated by semicolons.
Some specific parsing rules apply to mutations within the rpoB gene, which changes the output language on the LIMS report. These rules depend on the position of the mutation in the gene.
After the rules are applied and the mutations are collected, the information is written to a CSV file which concludes the creation of the LIMS report.
"},{"location":"algorithm/technical/#creating-the-coverage-report","title":"Creating the coverage report","text":"
The Parser class then reuses the Coverage object created first and calls the .reformat_coverage() method which adds any warnings, such as any deletion mutations detected for a gene. If a deletion is detected, a warning is useful because it indicates that although the reported coverage is less than 100%, it may be due to that deletion. If the coverage is still 100% and a deletion was identified, the warning will say that the deletion may be upstream.
The coverage dictionary and the associated warnings are then written to a CSV file which concludes the creation of the coverage report, and the tbp-parser script.
The inputs on this page reflect the parameters that are applicable for the command-line tool. To see the inputs required for tbp-parser when run as part of the TheiaProk workflow series, please refer to the TheiaProk Inputs page.
tbp-parser is designed to run immediately after Jody Phelan\u2019s TBProfiler tool. Only two inputs are required: the JSON file produced by TBProfiler and the BAM file produced by TBProfiler.
The JSON file contains information about the mutations detected in the sample: the quality, the type, and if that mutation confers resistance to an antimicrobial drug. The BAM file contains the alignment information for the sample and is needed for determining sequencing quality.
Parameter Description input_json The path to the JSON file that was produced by TBProfiler input_bam The path to the BAM file that was produced by TBProfiler
BAM index file required
The BAM file must have the accompanying BAI file in the same directory. It must also be named exactly the same as the BAM file but ending with a .bai suffix.
tbp-parser can be customized with a number of optional input parameters. These parameters can be used to control the quality control thresholds, the text that appears in the reports, and the names of the output files. The following is a list of all the input parameters that can be used with tbp-parser.
In addition to these arguments, tbp-parser also has a -h, --help argument that will out the list of possible arguments and their descriptions and a -v, --version argument that will print out the version of tbp-parser that is installed. Both of these commands exit the program after printing their output.
"},{"location":"inputs/inputs/#quality-control-arguments","title":"Quality Control Arguments","text":"
These options determine the thresholds for quality control.
Short Version Long Version Description Default Value -d --min_depth The minimum depth of coverage required for a site to pass QC 10 -c --min_percent_coverage The minimum percentage of a region that has depth above the threshold set by min_depth (used for a gene/locus to pass QC) 100 -s --min_read_support The minimum read support for a mutation to pass QC 10 -f --min_frequency The minimum frequency for a mutation to pass QC (0.1 -> 10%) 0.1 -r --coverage_regions A BED file containing the regions to calculate percent coverage for /data/tbdb-modified-regions.md"},{"location":"inputs/inputs/#text-arguments","title":"Text Arguments","text":"
These options are used verbatim in the reports, or are used to name the output files.
Short Version Long Version Description Default Value -m --sequencing_method The sequencing method used to gerneate the data; used in the LIMS & Looker reports. Enclose in quotes if including a space \"Sequencing method not provided\" -p --operator The operator who ran the analysis; used in the LIMS & Looker reports. Enclose in quotes if including a space \"Operator not provided\" -o --output_prefix The prefix to use for the output files. Do not include any spaces \"tbp-parser\""},{"location":"inputs/inputs/#lims-arguments","title":"LIMS Arguments","text":"
These options are used to customize the LIMS report
Name Description Default Value --add_cs_lims Adds Cycloserine (CS) fields to the LIMS report false"},{"location":"inputs/inputs/#tngs-specific-arguments","title":"tNGS-specific Arguments","text":"
These options are primarily used for tNGS data, although all frequency and read support arguments are compatible with WGS data.
Name Description Default Value --tngs Indicates that the input data was generated using the Deeplex + CDPH modified protocol. Turns on tNGS-specific global parameters false --tngs_expert_regions A BED file containing the regions to calculate coverage for expert rule regions. This is used to determine coverage quality in the regions where resistance-conferring mutations are found, or where a CDC expert rule is applied. This is not used for QC purposes /data/tbdb-expert-regions.bed --rrs_frequency The minimum frequency for an rrs mutation to pass QC, as rrs has several problematic sites in the Deeplex tNGS assay 0.1 --rrl_frequency The minimum frequency for an rrl mutation to pass QC, as rrl has several problematic sites in the Deeplex tNGS assay 0.1 --rrs_read_support The minimum read support for an rrs mutation to pass QC, as rrs has several problematic sites in the Deeplex tNGS assay 10 --rrl_read_support The minimum read support for an rrl mutation to pass QC, as rrl has several problematic sites in the Deeplex tNGS assay 10 --rpob449_frequency The minimum frequency for an rpoB mutation at protein position 449 to pass QC, as this site is problematic in the Deeplex tNGS assay 0.1 --etha237_frequency The minimum frequency for an ethA mutation at protein position 237 to pass QC, as this site is problematic in the Deeplex tNGS assay 0.1"},{"location":"inputs/inputs/#logging-arguments","title":"Logging Arguments","text":"
These options change the verbosity of the stdout log
Name Description Default Value --verbose Increases the output verbosity to describe which stage of the analysis is currently running false --debug The highest level of output verbosity detailing every step of the analysis and logic implemented; overwrites --verbose false"},{"location":"inputs/theiaprok/","title":"TheiaProk Inputs on Terra","text":"
When running tbp-parser as part of the TheiaProk workflow series (find documentation for TheiaProk here) on Terra.bio, an optional input must be activated to instruct TheiaProk to run tbp-parser.
tbp-parser is not on by default due to the nature of this tool and its outputs.
TheiaProk Version
This information only corresponds to the upcoming PHB v2.3.0 release. These inputs and outputs may not be applicable to other versions of TheiaProk.
To activate tbp-parser you must set the following variable to true:
Terra Task name Variable Type Description Default Value merlin_magic call_tbp_parser Boolean Set to true to activate tbp-parserfalse"},{"location":"inputs/theiaprok/#optional-inputs","title":"Optional Inputs","text":"
The following optional inputs are also available for user modification on Terra:
Terra Task name Variable Type Description Default Value merlin_magic tbp_parser_add_cs_lims Boolean Set to true to add Cycloserine (CS) fields to the LIMS report falsemerlin_magic tbp_parser_coverage_regions_bed File A BED file containing the regions to calculate percent coverage for tbdb-modified-regions.md merlin_magic tbp_parser_coverage_threshold Int The minimum percentage of a region that has depth above the threshold set by min_depth (used for a gene/locus to pass QC) 100 merlin_magic tbp_parser_debug Boolean Set to false to turn off debug mode for tbp-parsertruemerlin_magic tbp_parser_docker_image String The Docker image to use when running tbp-parser \"us-docker.pkg.dev/general-theiagen/theiagen/tbp-parser:2.1.0\" merlin_magic tbp_parser_etha237_frequency Float Minimum frequency for a mutation in ethA at protein position 237 to pass QC in tbp-parser 0.1 merlin_magic tbp_parser_expert_rule_regions_bed File A file that contains the regions where R mutations and expert rules are applied merlin_magic tbp_parser_min_depth Int Minimum depth for a variant to pass QC in tbp_parser 10 merlin_magic tbp_parser_min_frequency Int The minimum frequency for a mutation to pass QC 0.1 merlin_magic tbp_parser_min_read_support Int The minimum read support for a mutation to pass QC 10 merlin_magic tbp_parser_operator String Fills the \"operator\" field in the tbp_parser output files \"Operator not provided\" merlin_magic tbp_parser_output_seq_method_type String Fills out the \"seq_method\" field in the tbp_parser output files \"Sequencing method not provided\" merlin_magic tbp_parser_rpob449_frequency Float Minimum frequency for a mutation at protein position 449 to pass QC in tbp-parser 0.1 merlin_magic tbp_parser_rrl_frequency Float Minimum frequency for a mutation in rrl to pass QC in tbp-parser 0.1 merlin_magic tbp_parser_rrl_read_support Int Minimum read support for a mutation in rrl to pass QC in tbp-parser 10 merlin_magic tbp_parser_rrs_frequency Float Minimum frequency for a mutation in rrs to pass QC in tbp-parser 0.1 merlin_magic tbp_parser_rrs_read_support Int Minimum read support for a mutation in rrs to pass QC in tbp-parser 10 merlin_magic tbp_parser_tngs_data Boolean Set to true to enable tNGS-specific parameters and runs in tbp-parserfalse
Find the outputs for tbp-parser in TheiaProk on Terra here.
tbp-parser produces four files as outputs. See each individual page for more details on how they are constructed and what they contain:
Laboratorian report
LIMS report
Looker report
Coverage report
The four reports contain a wealth of information. The reports can be ordered from increasing to decreasing verbosity as follows: the laboratorian report, the LIMS report, the Looker report, and the coverage report. The same information is used in all four reports but at differing levels of verbosity.
Running tbp-parser as part of TheiaProk on Terra produces additional outputs. You can find that information in the TheiaProk Outputs on Terra page.
The coverage report lists every gene and its percent gene coverage over a minimum depth (default: 10) relative to the H37Rv genome.
Please note that user-provided coverage regions always take precedence over default values.
"},{"location":"outputs/coverage/#wgs-coverage-report","title":"WGS Coverage Report","text":"Column name Explanation Gene The name of the gene or locus Percent_Coverage The percent of the gene\u2019s coding region that has a read depth over the minimum value (default: 10; user-customizable by altering --min_depth) Warning Indicates if any deletions were identified in the gene which may contribute to lower than expected coverage
If run using the TheiaProk workflow series, there will be an additional column that contains only the name of the sample, which is useful when concatenating many reports as it helps differentiate which gene belongs to which sample.
If the --tngs flag is used, the report contains the following fields:
Column name Explanation Gene The name of the gene or locus Coverage_Breadth_reportableQC_region The percent of the gene (positions determined by the regions covered by the tNGS Deeplex + CDPH assay primers that are considered reportable by CDPH) that is covered at a depth greater than the --min_depth value QC_Warning Indicates if any deletions were identified in the gene which may contribute to lower than expected coverage Coverage_Breadth_R_expert-rule_region The percent of the regions (positions that could contain any resistance-conferring mutations or require expert-rule application) that is covered at a depth greater than the --min_depth value
Coverage regions are determined with either the default /data/tbdb-modified-regions.bed (collected on Sep 1, 2023 from the TBProfiler repository, or if --tngs, /data/tngs-reportable-regions.bed.
The R-expert rule region is determined only if --tngs is indicated and uses the ranges in /data/tbdb-expert-regions.bed.
The laboratorian report is the main report produced by tbp-parser and is used to generate all of the other reports. What follows is an explanation of all the columns in the report.
"},{"location":"outputs/laboratorian/#explanation-of-column-headers","title":"Explanation of column headers","text":"Column name Explanation sample_id The name of the sample tbprofiler_gene_name The name of the gene where the mutation has been identified tbprofiler_locus_tag The locus tag for the mutation that has been identified tbprofiler_variant_substitution_type The type of mutation identified, whether or not it was a frameshift, missense, or synonymous mutation tbprofiler_variant_substitution_nt The mutation in nucleotide format tbprofiler_variant_substitution_aa The mutation in amino acid format, if possible confidence Contains either:- the WHO annotation- an indication that there was no WHO annotation- NA for when there is no mutation antimicrobial The antimicrobial drug that may be affected by this mutation looker_interpretation The drug resistance interpretation intended for the Looker report mdl_interpretation The drug resistance interpretation intended for the LIMS report depth The depth of coverage at the mutation frequency The frequency of the mutation in the reads read_support How many reads support the mutation (depth * frequency) rationale Contains an indication of what was used (the WHO annotation, the specific expert rule used, or neither) to create the two interpretations warning Any potential quality warnings that may indicate lower reliability gene_tier The gene tier of the mutation\u2019s gene (Tier 1, Tier 2, or NA)
Because of how a particular mutation may contribute resistance to different drugs at the same time, each mutation is listed multiple times, once for each antimicrobial drug that could be affected. In addition, any genes that do not have any mutations are also included in the laboratorian report with NA or WT in the appropriate field. This results in a report with many rows and often, rows with very similar values. However, the laboratorian report contains the \u201ccomplete picture\u201d of the sample and is incredibly useful for understanding the sample\u2019s drug resistance profile.
The LIMS report is intended for direct import into a STAR LIMS system. The columns are in the specific LIMS code format for CDPH, and may not apply to your LIMS system. Please contact us if you need different column headers and we can work with you towards a solution.
"},{"location":"outputs/lims/#explanation-of-column-headers","title":"Explanation of column headers","text":"Column name Explanation MDL sample accession numbers The name of the sample M_DST_A01_ID The lineage of the sample in human-readable language M_DST_B01_INH The highest mdl_interpretation resistance identified for mutations associated with this drug (isoniazid) M_DST_B02_katG Any non-S mutations found in this gene with good quality responsible for the predicted resistance for ethionamideresponsible for the predicted resistance for isoniazid M_DST_B03_fabG1 Any non-S mutations found in this gene with good quality responsible for the predicted resistance for isoniazid M_DST_B04_inhA Any non-S mutations found in this gene with good quality responsible for the predicted resistance for isoniazid M_DST_C01_ETO The highest mdl_interpretation resistance identified for mutations associated with this drug (ethionamide) M_DST_C02_ethA Any non-S mutations found in this gene with good quality responsible for the predicted resistance for ethionamide M_DST_C03_fabG1 Any non-S mutations found in this gene with good quality responsible for the predicted resistance for ethionamide M_DST_C04_inhA Any non-S mutations found in this gene with good quality responsible for the predicted resistance for ethionamide M_DST_D01_RIF The highest mdl_interpretation resistance identified for mutations associated with this drug (rifampin) M_DST_D02_rpoB Any non-S mutations found in this gene with good quality responsible for the predicted resistance for rifampin M_DST_E01_PZA The highest mdl_interpretation resistance identified for mutations associated with this drug (pyrazinamide) M_DST_E02_pncA Any non-S mutations found in this gene with good quality responsible for the predicted resistance for pyrazinamide M_DST_F01_EMB The highest mdl_interpretation resistance identified for mutations associated with this drug (ethambutol) M_DST_F02_embA Any non-S mutations found in this gene with good quality responsible for the predicted resistance for ethambutol M_DST_F03_embB Any non-S mutations found in this gene with good quality responsible for the predicted resistance for ethambutol M_DST_G01_AMK The highest mdl_interpretation resistance identified for mutations associated with this drug (amikacin) M_DST_G02_rrs Any non-S mutations found in this gene with good quality responsible for the predicted resistance for amikacin M_DST_G03_eis Any non-S mutations found in this gene with good quality responsible for the predicted resistance for amikacin M_DST_H01_KAN The highest mdl_interpretation resistance identified for mutations associated with this drug (kanamycin) M_DST_H02_rrs Any non-S mutations found in this gene with good quality responsible for the predicted resistance for kanamycin M_DST_H03_eis Any non-S mutations found in this gene with good quality responsible for the predicted resistance for kanamycin M_DST_I01_CAP The highest mdl_interpretation resistance identified for mutations associated with this drug (capreomycin) M_DST_I02_rrs Any non-S mutations found in this gene with good quality responsible for the predicted resistance for capreomycin M_DST_I03_tlyA Any non-S mutations found in this gene with good quality responsible for the predicted resistance for capreomycin M_DST_J01_MFX The highest mdl_interpretation resistance identified for mutations associated with this drug (moxifloxacin) M_DST_J02_gyrA Any non-S mutations found in this gene with good quality responsible for the predicted resistance for moxifloxacin M_DST_J03_gyrB Any non-S mutations found in this gene with good quality responsible for the predicted resistance for moxifloxacin M_DST_K01_LFX The highest mdl_interpretation resistance identified for mutations associated with this drug (levofloxacin) M_DST_K02_gyrA Any non-S mutations found in this gene with good quality responsible for the predicted resistance for levofloxacin M_DST_K03_gyrB Any non-S mutations found in this gene with good quality responsible for the predicted resistance for levofloxacin M_DST_L01_BDQ The highest mdl_interpretation resistance identified for mutations associated with this drug (bedaquiline) M_DST_L02_Rv0678 Any non-S mutations found in this gene with good quality responsible for the predicted resistance for bedaquiline M_DST_L03_atpE Any non-S mutations found in this gene with good quality responsible for the predicted resistance for bedaquiline M_DST_L04_pepQ Any non-S mutations found in this gene with good quality responsible for the predicted resistance for bedaquiline M_DST_L05_mmpL5 Any non-S mutations found in this gene with good quality responsible for the predicted resistance for bedaquiline M_DST_L06_mmpS5 Any non-S mutations found in this gene with good quality responsible for the predicted resistance for bedaquiline M_DST_M01_CFZ The highest mdl_interpretation resistance identified for mutations associated with this drug (clofazimine) M_DST_M02_Rv0678 Any non-S mutations found in this gene with good quality responsible for the predicted resistance for clofazimine M_DST_M03_pepQ Any non-S mutations found in this gene with good quality responsible for the predicted resistance for clofazimine M_DST_M04_mmpL5 Any non-S mutations found in this gene with good quality responsible for the predicted resistance for clofazimine M_DST_M05_mmpS5 Any non-S mutations found in this gene with good quality responsible for the predicted resistance for clofazimine M_DST_N01_LZD The highest mdl_interpretation resistance identified for mutations associated with this drug (linezolid) M_DST_N02_rrl Any non-S mutations found in this gene with good quality responsible for the predicted resistance for linezolid M_DST_N03_rplC Any non-S mutations found in this gene with good quality responsible for the predicted resistance for linezolid Analysis date The date tbp-parser was run in YYYY-MM-DD HH:SS format Operator The name of the person who ran tbp-parser; can be provided with the --operator input parameter. If left blank, \u201cOperator not provided\u201d is the default value. M_DST_O01_lineage The lineage of the sample (the main_lin of the sample as reported by TBProfiler) M_DST_P01_CS The highest mdl_interpretation resistance identified for mutations associated with this drug (cycloserine); only included when --add_cs_lims is set to true M_DST_P02_ald Any non-S mutations found in this gene with good quality responsible for the predicted resistance for cycloserine; only included when --add_cs_lims is set to true M_DST_PO3_alr Any non-S mutations found in this gene with good quality responsible for the predicted resistance for cycloserine; only included when --add_cs_lims is set to true
The LIMS report offers a condensed version of the laboratorian report with more details than the Looker report. By containing only the most important information about a drug and its related mutations, the LIMS report provides an invaluable summary.
The Looker report is intended for use in Google's Looker Studio Data Studio for dashboarding purposes. It offers a highly condensed version of the resistance calls (using the looker_interpretation field from the laboratorian report) for a quick summary of the sample\u2019s drug resistance profile.
"},{"location":"outputs/looker/#explanation-of-column-headers","title":"Explanation of column headers","text":"Column name Explanation sample_id The name of the sample output_seq_method_type The sequencing method used to generate the data; can be set with the --sequencing_method input parameter. If left blank, \u201cSequencing method not provided\u201d is the default value amikacin The highest looker_interpretation resistance identified for mutations associated with this drug bedaquiline The highest looker_interpretation resistance identified for mutations associated with this drug capreomycin The highest looker_interpretation resistance identified for mutations associated with this drug clofazimine The highest looker_interpretation resistance identified for mutations associated with this drug ethambutol The highest looker_interpretation resistance identified for mutations associated with this drug ethionamide The highest looker_interpretation resistance identified for mutations associated with this drug isoniazid The highest looker_interpretation resistance identified for mutations associated with this drug kanamycin The highest looker_interpretation resistance identified for mutations associated with this drug levofloxacin The highest looker_interpretation resistance identified for mutations associated with this drug linezolid The highest looker_interpretation resistance identified for mutations associated with this drug moxifloxacin The highest looker_interpretation resistance identified for mutations associated with this drug pyrazinamide The highest looker_interpretation resistance identified for mutations associated with this drug rifampin The highest looker_interpretation resistance identified for mutations associated with this drug streptomycin The highest looker_interpretation resistance identified for mutations associated with this drug lineage The lineage of the sample (the main_lin field as reported by TBProfiler); for example, lineage1.2.1.2.1 ID The lineage of the sample in human-readable language (the same as M_DST_A01_ID in the LIMS report) analysis_date The date tbp-parser was run in YYYY-MM-DD HH:SS format operator The name of the person who ran tbp-parser; can be provided with the --operator input parameter. If left blank, \u201cOperator not provided\u201d is the default value.
Please note that occasionally, the looker_interpretation field can differ from the mdl_interpretation field. Typically, they are identical, but occasionally, the mdl_interpretation column will call a variant-drug combination \u201csusceptible\u201d (S), while the looker_interpretation column will call the same combination \u201cuncertain\u201d (U). Be aware of this difference when choosing an interpretation to report.
"},{"location":"outputs/theiaprok/","title":"TheiaProk Outputs on Terra","text":"
When running tbp-parser as part of the TheiaProk workflow series (find documentation for TheiaProk here) on Terra.bio, you will find the following outputs in your data table.
TheiaProk Version
This information only corresponds to the upcoming PHB v2.3.0 release. These inputs and outputs may not be applicable to other versions of TheiaProk.
Variable Type Description tbp_parser_average_genome_depth Float The average depth of coverage across the H37Rv reference genome tbp_parser_coverage_report File The coverage report generated by tbp-parser tbp_parser_docker String The Docker image used to run tbp-parser tbp_parser_genome_percent_coverage Float The percentage of the H37Rv reference genome that has depth above the threshold set by tbp_parser_min_depth tbp_parser_laboratorian_report_csv File The laboratorian report generated by tbp-parser tbp_parser_lims_report_csv File The LIMS report generated by tbp-parser tbp_parser_looker_report_csv File The Looker report generated by tbp-parser tbp_parser_version String The version of tbp-parser used in the analysis as determined by tbp-parser --version
Find the inputs for tbp-parser in TheiaProk on Terra here.
"},{"location":"versioning/","title":"Versioning and Releases","text":""},{"location":"versioning/#validated-versions","title":"Validated Versions","text":"
The California Department of Public Health has clinically validated the following versions:
v1.2.2 for WGS, and
v1.4.4.8 for tNGS
Validate Before Use
CAUTION: The information produced by this program should not be used for clinical reporting unless and until extensive validation has occured in your laboratory on a stable version. Otherwise, the outputs of tbp-parser are for research use only.
For more information on the differences between versions, you can see the Brief Description of Versions or the Exhaustive List of Versions.
"},{"location":"versioning/brief/","title":"Brief Description of Versions","text":"
You may notice there are many releases; tbp-parser is in active development and each release is \"use at your own risk.\" We highly recommend upgrading to the latest release as they include important bug fixes. In order to help track the different changes, we have included a brief description of each release:
v1.2.x & below - the initial developmental stages of tbp-parser for WGS data
v1.3.x - the addition of tNGS data parsing and includes some updates applicable to WGS parsing
v1.4.x - reworks how QC is performed (changes in order of operations)
v1.4.3+ - changes how tNGS lineage determination is performed
v1.4.4+ - changes how nonsynonymous mutations are interpretted; major interpretation differences between earlier versions
v1.6.x - only considers the genes included in the LIMS report to determine the drug output in the LIMS report
v1.5.x+ and v2.0.0 - major changes to code in due to using results from TBProfiler v6.2.0+
v2.1.0 - v1.6.0 and earlier versions are no longer supported; v2.1+ changes are included on main branch moving foward.
For a more exhaustive list, please visit the Exhaustive List of Versions.
"},{"location":"versioning/exhaustive/","title":"Exhaustive version descriptions","text":"
The following is a list of every version of tbp-parser and a short summary of the changes made in each version.
Blue indicates that CDPH performed a clinical validation on that version
v1.0.0 - initial version
v1.1.0 - adjusts the highest interpretation for a drug to only consider genes in LIMS report, adds the rule to the confidence column, adds QRDR expert rules for gyrA and gyrB
v1.1.1 - fixes a bug in R/QRDR region calculations
v1.1.2 - adjusts LIMS lineage designation by checking for BCG and if lineage from TB Profiler is empty
v1.1.3 - now includes the TB Profiler sublineage output when determining BCG M bovis
v1.1.4 - now checks if multiple lineages/sublineages were detected
v1.1.5 - checks all mmpS/mmpL/mmpR alternate consequences; also checks to make sure all drugs are reported
v1.1.5.1 - renames rifampicin to rifampin
v1.1.6 - removes a locus warning with deletion caveat
v1.1.7 - ensures all deletion caveat locus warnings are gone, overwrites all fields with locus warning with \u201cNA\u201d or \u201cInsufficient Coverage\u201d as appropriate and moves them to the bottom of the Laboratorian report
v1.1.8 - changes overwrite to only overwrite interpretation values, not mutation information
v1.1.9 - renames rifampicin to rifampin
v1.2.0 - enables ability to provide alternate coverage bed file; introduced the modified regions (just coding region + 30bp upstream or promoter region)
v1.2.1 - fixes a bug when renaming rifampicin to rifampin
v1.2.2 (WGS) - improve how maximum MDL interpretation is calculated for the LIMS report. Use the smw-tb-2024-01-16-dev branch on Terra.
v1.2.3 - check only the LIMS genes\u2019 coverage for LIMS lineage determination and use a threshold for all lineage designation
v1.3.0 - adds tNGS regions, checks to make sure that only variants for genes in the coverage report are included in the laboratorian (tNGS), error-proof locus tag designation, add check to prevent failures when gene not in coverage dictionary (tNGS), adds \u201cNA\u201d to the mutation rank list (score = 0, same as Insufficient Coverage)
v1.3.1 - adds --tngs flag to turn on tNGS-specific global parameters, establishes different threshold calculation for lineage designation for tNGS, checks the segment of a gene a variant was detected in, removes check that did not prevent failures when gene not in coverage dictionary from v1.3.0, error-proof all coverage checks, adds \u201cThis mutation is outside the expected region\u201d warning
v1.3.2 - error-proofs coverage warning and adds additional section for tNGS gene segments, error-proofs gene tier for tNGS gene segments
v1.3.3 - condenses most gene segments into one, for WT mutations, set the mutation to \u201cWT\u201d not \u201cNA\u201d
v1.3.4 - error-proofs maximum mdl interpretation determination and maximum looker interpretation determination
v1.3.5 - adds rrs & rrl frequency input parameters to customize mutation frequency for those genes , overwrites gene MDL interpretation when \u201cInsufficient Coverage\u201d to act as if \u201cWT\u201d if greater than S
v1.3.6 - adds the TBProfiler lineage to the end of the LIMS report and the Looker report, adds LIMS lineage to Looker report, introduces check if max MDL interpretation is also Insufficient Coverage to change output to Pending Retest
v1.3.7 - add to the coverage report the \u201cexpert rule regions\u201d column for tNGS, overwrites gene MDL interpretation when \u201cInsufficient Coverage\u201d to act as if \u201cWT\u201d if gr **eater than or equal to S
v1.3.8 - add frequency input parameters for rpoB 449 and ethA 237, renames coverage threshold to minimum percent coverage
v1.3.9 - check if gene name is rpoB because that means it\u2019s outside the expected region (tNGS - rpoB is in two segments), add rrs and rrl read support input parameters
v1.4.0 - rework how QC is performed (order of operations)
v1.4.1 - remove rpoB expected region check, implements deletion position quality check in QC (keep only valid deletions), if outside expected region warning, set MDL interpretations to NA
v1.4.2.1 (same change in v1.5.4) - prevent overwriting \u201cR\u201d mutations with No Sequence, and overwrite \u201cU\u201d mutations with \u201cPending Retest\u201d if bad quality
v1.4.3 - implement different thresholds for LIMS lineage identification for tNGS,
v1.4.4 - update expert rule interpretations (mainly S \u2192 U in several spots)
v1.4.4.1 (v1.5.0 branched off of this one)- update LIMS threshold to 90, not the coverage threshold
v1.4.4.2 (same change in v1.5.1) - fix an issue where \u201cNo sequence\u201d was not triggering Pending Retest
v1.4.4.3 (same change in v1.5.5) - fix an issue where \u201cPending Retest\u201d was not properly appearing
v1.4.4.4 (same change in v1.5.6) - prevent \u201cPending Retest\u201d if Insufficient Coverage is in a gene that also has a valid deletion
v1.4.4.5 - consider deletions invalid if coverage is between 0 and minimum coverage (10 default) (this consideration is unique to old TB Profiler and not mimicked in v1.5)
v1.4.4.6 - a mistake; updates the version (this release is a mystery to me as there is nothing in there except version update)
v1.4.4.7 (same change in v1.5.8) - change tNGS LIMS lineage designation to items in the coverage dictionary (to represent both rpoB segments)
v1.4.4.8 (tNGS) (same change in v1.5.9)- reduce tNGS LIMS threshold to 70% from 90. Use the smw-tb-2024-05-03-dev branch on Terra for this and all subsequent v1.4.4.x+ versions.
v1.4.4.9 (same change in v1.5.7) - add optional input to add cycloserine to LIMS report
v1.4.4.10 - fix issue when MDL resistance was being overwritten to Pending Retest but without considering other genes when calculating the highest MDL resistance (as the other genes may have had higher resistances that were not captured at first)
v1.4.4.11 - fix issue introduced by last fix where we ran into indexing errors due to no more MDL interpretations available in the list
v1.5.0 (branched off of v1.4.4.1)- make all language changes necessary to be compatible with TBProfiler v6.2.1. Use the smw-tb-2024-05-03-who2-dev branch on Terra for this and all subsequent v1.5.x+ versions.
v1.5.1 (same change in v1.4.4.2)- fix an issue where \u201cNo sequence\u201d was not triggering Pending Retest
v1.5.2 - a mistake; somehow exactly the same as 1.4.4.2?? (this release is also a mystery)
v1.5.3 - make additional language changes and fix an unusual edge case where the same mutation was identified; rename mmpR5 to Rv0678 again
v1.5.4 (same change in v1.4.2.1) - prevent overwriting \u201cR\u201d mutations with No Sequence
v1.5.5 (same change in v1.4.4.3 - fix an issue where \u201cPending Retest\u201d was not properly appearing; consider only LIMS genes for LIMS reort
v1.5.6 (same change in v1.4.4.4) - prevent \u201cPending Retest\u201d if Insufficient Coverage is in a gene that also has a valid deletion
v1.5.7 (same change in v1.4.4.9) - add optional input to add cycloserine to LIMS report
v1.5.8 (same change in v1.4.4.7) - change tNGS LIMS lineage designation to check items in the coverage dictionary (to represent both rpoB segments; percentage calculation erroneously combined them)
v1.5.9 (same change in v1.4.4.8) - reduce tNGS LIMS threshold to 70% from 90
v1.5.10 - correct spelling of two genes in the LIMS report for cycloserine
v1.6.0 (branched off of v1.4.4.11) - ensures that only LIMS genes are being considered for the LIMS report. Use the smw-tb-2024-05-03-dev branch on Terra for this and all subsequent v1.6.x+ versions.
v2.0.0 (branched off of v1.5.10; same change in v1.4.4.10 and v1.4.4.11) - fix issue when MDL resistance was being overwritten to Pending Retest but without considering other genes when calculating the highest MDL resistance (as the other genes may have had higher resistances that were not captured at first) and fixes the resulting issue where indexing errors occurred\u00a0due to no more MDL interpretations. Use the smw-tb-2024-05-03-who2-dev branch on Terra for this and all subsequent v2.x+ versions.
v2.1.0 - any mutations in the 60 proximal promoter regions included in the WHO v2 database (Table 22, page 89-90). Use either the smw-tbprofiler-updates-dev branch until the time of the v2.3.0 release of TheiaProk on Terra for this and all subsequent v2.1.x+ versions
Earlier versions are now deprecated and will no longer be supported.
The following diagram shows how each version is related to the others without technical details:
"}]}
\ No newline at end of file
diff --git a/v2.2.1/usage/index.html b/v2.2.1/usage/index.html
index 801ce9a..6fcc486 100644
--- a/v2.2.1/usage/index.html
+++ b/v2.2.1/usage/index.html
@@ -1035,16 +1035,17 @@
We host our Docker images on the Google Artifact Registry so that they are always availble for usage.
-
The entrypoint for this Docker iamge is the tbp-parser help message. To run this container interactively, use the following command:
-
dockerrun-it--entrypoint=/bin/bashus-docker.pkg.dev/general-theiagen/theiagen/tbp-parser:1.6.0
-# Once inside the container interactively, you can run the tbp-parser tool
-python3/tbp-parser/tbp_parser/tbp_parser.py-v
-# v1.6.0
+
The entrypoint for this Docker image is the tbp-parser help message. To run this container interactively, you can use the following command:
+
dockerrun-it--entrypoint=/bin/bashus-docker.pkg.dev/general-theiagen/theiagen/tbp-parser:2.1.0
+
+# Once inside the container interactively, you can run the tbp-parser tool
+python3/tbp-parser/tbp_parser/tbp_parser.py-v
+# v2.1.0
This shows how the script can be run if used inside the Docker container provided above.
@@ -1067,7 +1068,7 @@
Example Usage --sequencing_method "Illumina NextSeq" \
--operator "John Doe"
-
Please note that the BAM file must have the accompanying BAI file in the same directory. It must also be named exactly the same as the BAM file but ending with a .bai suffix.
+
Please note that the BAM file must have the accompanying BAI file in the same directory.
The help message printed by tbp-parser is quite extensive, but has a lot of useful information regarding the input parameters. Here is the entire message in full. You can find more information regarding these inputs in the Inputs section.
v1.3.3 - condenses most gene segments into one, for WT mutations, set the mutation to “WT” not “NA”
v1.3.4 - error-proofs maximum mdl interpretation determination and maximum looker interpretation determination
v1.3.5 - adds rrs & rrl frequency input parameters to customize mutation frequency for those genes , overwrites gene MDL interpretation when “Insufficient Coverage” to act as if “WT” if greater than S
-
v1.3.6 - adds the TB-Profiler lineage to the end of the LIMS report and the Looker report, adds LIMS lineage to Looker report, introduces check if max MDL interpretation is also Insufficient Coverage to change output to Pending Retest
+
v1.3.6 - adds the TBProfiler lineage to the end of the LIMS report and the Looker report, adds LIMS lineage to Looker report, introduces check if max MDL interpretation is also Insufficient Coverage to change output to Pending Retest
v1.3.7 - add to the coverage report the “expert rule regions” column for tNGS, overwrites gene MDL interpretation when “Insufficient Coverage” to act as if “WT” if gr **eater than or equal to S
v1.3.8 - add frequency input parameters for rpoB 449 and ethA 237, renames coverage threshold to minimum percent coverage
v1.3.9 - check if gene name is rpoB because that means it’s outside the expected region (tNGS - rpoB is in two segments), add rrs and rrl read support input parameters
@@ -997,6 +997,10 @@
Exhaustive List of Versions
v1.5.10 - correct spelling of two genes in the LIMS report for cycloserine
v1.6.0 (branched off of v1.4.4.11) - ensures that only LIMS genes are being considered for the LIMS report. Use the smw-tb-2024-05-03-dev branch on Terra for this and all subsequent v1.6.x+ versions.
v2.0.0 (branched off of v1.5.10; same change in v1.4.4.10 and v1.4.4.11) - fix issue when MDL resistance was being overwritten to Pending Retest but without considering other genes when calculating the highest MDL resistance (as the other genes may have had higher resistances that were not captured at first) and fixes the resulting issue where indexing errors occurred due to no more MDL interpretations. Use the smw-tb-2024-05-03-who2-dev branch on Terra for this and all subsequent v2.x+ versions.
+
v2.1.0 - any mutations in the 60 proximal promoter regions included in the WHO v2 database (Table 22, page 89-90). Use either the smw-tbprofiler-updates-dev branch until the time of the v2.3.0 release of TheiaProk on Terra for this and all subsequent v2.1.x+ versions
+
Earlier versions are now deprecated and will no longer be supported.
+
+
The following diagram shows how each version is related to the others without technical details:
@@ -1023,7 +1027,7 @@
The California Department of Public Health has clinically validated the following versions:
v1.2.2 for WGS, and
v1.4.4.8 for tNGS
-
Interpretation documents for v1.2.2 and v1.4.4.8 are available in the root directory of the tbp-parser repository; others are available in the interpretation_docs directory on GitHub.
+
+
Validate Before Use
+
CAUTION: The information produced by this program should not be used for clinical reporting unless and until extensive validation has occured in your laboratory on a stable version. Otherwise, the outputs of tbp-parser are for research use only.