Skip to content

Impute variants from any fastq file - process for homozygous individuals

Quick Start

  1. Change keyfile1, param2, param3, and param4 in your config file to match file paths on your computer.
  2. Run phg findPaths [config.txt]

Details

This step uses the Minimap2 index and pangenome fasta files created in the previous step of the pipeline. It aligns a set of reads from an input fastq file to the graph and then uses an HMM to find the most likely path through the graph given the alignments. The database "paths" and "read_mapping_paths" tables are populated from the plugins called by this script.

Kitchen Sink

This step uses the pangenome fasta to find paths for new individuals. It requires a set of fastq reads. The type of sequencing data does not matter, but you should be working with fairly homozygous material. If you would like to run the findPaths step with heterozygous material, use the heterozygous findPaths step.

This pipeline has two main steps:

  1. Run FastqToMappingPlugin to map a set of reads to the pangenome fasta file. This plugin will make use of a keyfile and will store the mappings in the DB.

  2. Run BestHaplotypePathPlugin to take the mappings from the DB and use a HMM to find the optimal path through the DB. This will then store the paths into the DB.

The findPaths pipeline requires the use of 2 keyfiles. The second is autogenerated, but could be changed if different results are desired. Both files need to be tab-separated. If there are entries in the keyfile, but not found on the filesystem, the pipelines will skip over those entries. More information about these keyfiles can be found here.

There are 8 parameters used in this step:

  • BASE_HAPLOTYPE_NAME: This is the base of the name of the haplotype fasta file. This file should have been indexed in the previous step.
  • CONFIG_FILE_NAME: This is the path to the config.txt file used to create the DB. All the needed DB connection information should be in here.
  • HAPLOTYPE_METHOD: Method name of the haplotypes in the graph that were used to generate the haplotype fasta file in IndexPangenome.sh. This needs to match exactly otherwise the results will not be correct.
  • HAPLOTYPE_METHOD_FIND_PATH: This method can be the same as HAPLOTYPE_METHOD, but it is typically used to only include anchor reference ranges when running FindPath. Typically, finding paths through the interanchor reference ranges can cause additional errors. An example of what could be used here is this: HAPLOTYPE_METHOD,refRegionGroup to only include the refRegionGroup refRange group in the Graph used for finding the path.
  • HAPCOUNT_METHOD_NAME: Name of the Haplotype Mapping method used to upload the ReadMapping files to the DB. This is currently not used so a dummy value can be used. It will be implemented in the future.
  • PATH_METHOD_NAME: Name of the Path Mapping method used to upload the Paths to the DB. This method name will be used in the next step (phg exportPath) to extract out the paths from the DB.
  • READ_KEY_FILE: This name needs to match the name of the keyfile. This keyfile will describe what fastq files need to be aligned together and also denotes the required metadata fields which are stored in the DB.
  • PATH_KEY_FILE: This name is what the path finding keyfile will be named. FastqToMappingPlugin will create this file and then BestHaplotypePathPlugin will use it to find paths. Note that FastqToMappingPlugin will group reads by taxon and all the mappings for a given taxon will be used when finding the paths.

Details on running this step with wrapper scripts

When running this step on the command line, all file paths and parameters are set in the config file. The only call that needs to be run in the terminal is phg findPath /path/to/config.txt. If you would like to overwrite the parameters set in the config file, you can do that by setting the parameters on the command line directly.

For example, to ignore the config file HAPCOUNT_METHOD_NAME level and set one directly, you could run:

phg findPaths -configFile /path/to/config.txt -HAPCOUNT_METHOD_NAME MyNewMethod1

You can also run the FindPathMinimap2.sh bash script directly with the following run command:

#!bash

FindPathMinimap2.sh [BASE_HAPLOTYPE_NAME] [CONFIG_FILE_NAME] [HAPLOTYPE_METHOD] [HAPLOTYPE_METHOD_FIND_PATH] [HAPCOUNT_METHOD_NAME] [PATH_METHOD_NAME] [READ_KEY_FILE] [PATH_KEY_FILE]

Details on running this step through docker

When this step is run as part of a Docker container script, the Docker expect the following directory mount points:

  • Mount localMachine:/pathToInputs/FastQFiles/ to docker:/tempFileDir/data/fastq
  • Mount localMachine:/pathToOutputs/ to docker:/tempFileDir/outputDir/
  • Mount localMachine:/pathToPangenomeIndex/ to docker:/tempFileDir/outputDir/pangenome/
  • Mount localMachine:/pathToInputs/config.txt to docker:/tempFileDir/data/config.txt
  • Mount localMachine:/pathToInputs/phg.db to docker:/tempFileDir/outputDir/phgTestMaizeDB.db. This needs to match what is in the config file.

It is expected the database is stored in the User specified outputDir that is mounted below and the config.txt specifies the database name and login parameters.

It is critical that the .mmi file is mounted to /tempFileDir/outputDir/pangenome/ in the docker. Otherwise this will not work correctly.

An example Docker script to run the FindPathMinimap2.sh shell script is:

#!bash
DB=/workdir/user/DockerTuningTests/DockerOutput/phgTestMaizeDB.db
PANGENOME_DIR=/workdir/user/DockerTuningTests/DockerOutput/PangenomeFasta/  

docker run --name small_seq_test_container --rm \ 
    -w / \
    -v /workdir/user/DockerTuningTests/DockerOutput/:/tempFileDir/outputDir/ \
    -v /workdir/user/DockerTuningTests/InputFiles/GBSFastq/:/tempFileDir/data/fastq/ \
    -v /workdir/user/DockerTuningTests/InputFiles/config.txt:/tempFileDir/data/configSQLite.txt \
    -v ${DB}:/tempFileDir/outputDir/phgTestMaizeDB.db \
    -v ${PANGENOME_DIR}:/tempFileDir/outputDir/pangenome/ \
    -t maizegenetics/phg:latest \
    /FindPathMinimap2.sh phgSmallSeqSequence configSQLite.txt \
        CONSENSUS CONSENSUS,refRegionGroup \
        HAP_COUNT_METHOD PATH_METHOD \
        /tempFileDir/data/fastq/genotypingKeyFile.txt \
        /tempFileDir/data/fastq/genotypingKeyFile_pathKeyFile.txt

PANGENOME_DIR must match the directory set in the IndexPangenome step.

The --name parameter provides a name for the container. This is optional.

The --rm parameter indicates the container should be deleted when the program finishes executing. This is optional.

Note the -w / parameter. This is needed to guarantee that the script will run correctly. When running a normal docker, this is likely not needed, but if running on a system like cbsu, the working directory needs to be set to the root directory.

The -v directives are used to mount data from the user machine into the Docker. The path preceding the ":" is the path on the user machine. The directory path following the ":" are the paths inside the Docker where the user home directories will be mounted.

The -t directive indicates the Docker image of which this container will be an instance. The last line tells the Docker container to run the FindPath.sh script which is found in the root directory. The items following are the parameters to the FindPath.sh script.

Files

Config file

An example can be found here: Master config file

Pangenome fasta file (with index)

This is the pangenome fasta file created in the previous step. The name of that file should be the same as the value you set for the BASE_HAPLOTYPE_NAME parameter.

Read key file

Tab-delimited text file that sets up fastq alignment to the pangenome fasta. Has a header with "cultivar\tflowcell_lane\tfilename\tPlateID" and corresponding information for each fastq file used in the findPaths step. Only the first three columns are required. More information and an example are here: findPaths keyfiles

Path key file

This file is created automatically, so you will not need to create one on your own unless you would like to change some of the values in it. This file has three columns: "sampleName\tReadMappingIds\tLikelyParents". Only the first two columns are required. More information and an example are here: findPaths keyfiles

Plugins

FastqToMappingPlugin

The FastqToMappingPlugin executes code to count the number of reads which align to a haplotype node using Minimap 2.

If running in paired mode, it will align the read pairs using minimap and process the resulting SAM records. Only equally optimal mappings (by Edit Distance (NM) are kept. Additional filtering is done to remove reads which are unmapped or clipped.

If running in paired mode, the reads must be on opposite strands and both must currently hit the same haplotype in a single reference range. Optimal mappins across reference ranges are processed, but reads are only assigned to haplotypes in the reference range with the most hits (providing that at least 1-maxRefRangeError percentage hit that reference range). Based on testing, this seems to work the best to balance the number of reads used with accuracy. The plugin requires a PHG Haplotype Graph, which can be acquired by chaining the running of HaplotypeGraphBuilderPlugin and this plugin.

The output from the HaplotypeGraphBuilderPlugin will be input to the BestHaplotypePathPlugin.

The parameters for this plugin are:

  • -configFile DB Config File containing properties host,user,password,DB and DBtype where DBtype is either sqlite or postgres (Default=null) (REQUIRED) A sample config file can be found here:Master Config File.
  • -minimap2IndexFile : Name of the indexFile file to process (required)
  • -keyFile : Name of the Keyfile to process. Must have columns cultivar, flowcell_lane, filename, and PlateID. Optionally for paired end reads, filename2 is needed. If filename2 is not supplied, Minimap2 will run in single end mode. Otherwise will be paired. (required)
  • -fastqDir : Name of the Fastq dir to process. (required)
  • -maxRefRangeErr : Maximum allowed error when choosing best reference range to count. Error is computed 1 - (mostHitRefCount/totalHits) (Default: 0.25)
  • -lowMemMode : Run in low memory mode. (Default: true)
  • -maxSecondary : Maximum number of secondary alignments to be returned by minimap2. This will be the value of the -N parameter in the minimap2 command line. If the value is too low, some valid read mappings will not be reported. (Default: 20)
  • -minimapLocation : Location of Minimap2 on file system. This defaults to use minimap2 if it is on the PATH environment variable. (Default: minimap2)
  • -methodName : Method name to be stored in the DB. (required)
  • -methodDescription : Method description to be stored in the DB. (required)
  • -debugDir : Directory to write out the read mapping files. This is optional for debug purposes. (Default: )
  • -outputSecondaryStats : Ouptput Secondary Mapping Statistics such as total AS for each haplotype ID (Default: false)

The FastqToMappingPlugin requires a haplotype graph, which can be obtained with the HaplotypeGraphBuilderPlugin. An example for chaining these plugins is below:

perl /tassel-5-standalone/run_pipeline.pl -debug ${fullXmx} \
-configParameters ${CONFIGFILE} \
-HaplotypeGraphBuilderPlugin \
    -configFile ${CONFIGFILE} \
    -methods ${HAPLOTYPE_METHOD} \
    -includeVariantContexts false \
    -includeSequences false \
    -endPlugin \
-FastqToMappingPlugin \
    -minimap2IndexFile ${HAPLOTYPE_INDEX} \
    -keyFile ${KEY_FILE} \
    -fastqDir ${FASTQ_DIR}/ \
    -methodName ${HAP_COUNT_METHOD} \
    -methodDescription READ_MAPPING_DESCRIPTION \
    -debugDir $OUTPUT_DIR \
    -endPlugin

BestHaplotypePathPlugin

This plugin takes a haplotype graph and a set of read mappings to infer the best (most likely) path through the graph given the read mappings. Read mappings are a list of reads with a set of haplotypes to which that read aligned.

The plugin can (1) take a file of read mappings and return a file with a list of haplotypes or (2) take read mappings from a PHG DB and store the resulting list of haplotypes in the DB.

If (1), the input is a file, then the plugin can take either a file or a directory containing multiple files. If a directory, all read mapping files will be processed and the haplotype lists output as separate files to an output directory. If the output directory is not specified, then the lists will be written to the input directory. Any path files of the same name will not be overwritten; a message will be written to the log to that effect, unless the "overwrite" flag is set to true.

If (2), the input comes from a PHG DB, an input read map method and the output path method must be supplied. In addition, a specific taxon or list of taxa for which paths are to be imputed can be suppled. If paths for any of the taxa and methods exist, the paths will not be imputed and a warning message will be written to the log file. If the "overwrite" flag is set to true, any existing paths will be overwritten and a message to that effect will be written to the log.

The parameters for this plugin are:

  • -keyFile : KeyFile file name. Must be a tab separated file using the following headers: SampleName ReadMappingIds LikelyParents
    • ReadMappingIds and LikelyParents need to be comma separated for multiple values (required)
  • -readFile : Filename of read mappings. Do not supply both a filename and a directory.
  • -readDir : Directory of read mapping files. If this is supplied, do not also assign a read filename.
  • -outDir : Directory to which path files will be written.
  • -readMethod : The name of the read mapping method in the PHG DB (required)
  • -pathMethod : The name of the path method used to write the results to the PHG DB (required)
  • -overwrite : If an output pathfile already exists for a taxon, then it will be overwritten if overwrite = true. Otherwise, it will not and a warning will be written to the log. Likewise for paths in the PHG DB. (Default: false)
  • -minTaxa : minimum number of taxa per anchor reference range. Ranges with fewer taxa will not be included in the output node list. (Default: 20)
  • -minReads : minimum number of reads per anchor reference range. Ranges with fewer reads will not be included in the output node list. (Default: 1)
  • -maxReads : maximum number of include counts per anchor reference range Kb. Ranges with more reads will not be included in the output node list. (Default: 10000)
  • -maxNodes : maximum number of nodes per reference range. Ranges with more nodes will not be included in the output node list. (Default: 1000)
  • -minTransitionProb : minimum probability of a transition between nodes at adjacent reference ranges. (Default: 0.001)
  • -probCorrect : minimum number of reads per anchor reference range. Ranges with fewer reads will not be included in the output node list. (Default: 0.99)
  • -splitNodes : split consensus nodes into one node per taxon. (Default: true)
  • -splitProb : When the consensus nodes are split by taxa, this is the transition probability for moving from a node to the next node of the same taxon. It equals 1 minus the probability that the path will switch between taxa. (Default: 0.99)
  • -usebf : Use the Backward-Forward algorithm instead of the Viterbi algorithm for the HMM. (Default: false)
  • -minP : Only nodes with minP or greater probability will be kept in the path when using the Backward-Forward algorithm, (Default: 0.8)
  • -bfInfoFile : The base name of the file to node probabilities from the backward-forward algorithm will be written. taxonName.txt will be appended to each file.
  • -removeEqual : Ranges with equal read counts for all haplotypes should be removed from the graph. Defaults to true but will be always be false if minReads = 0. (Default: true)
  • -numThreads : Number of threads used to find paths. The path finding will subtract 2 from this number to have the number of worker threads. It leaves 1 thread for IO to the DB and 1 thread for the Operating System. (Default: 3)
  • -requiredTaxa : Optional list of taxa required to have haplotypes. Any reference range that does not have a haplotype for one of these taxa will not be used for path finding. This can be a comma separated list of taxa (no spaces unless surrounded by quotes), file (.txt) with list of taxa names to include, or a taxa list file (.json or .json.gz). By default, all taxa will be included.
  • -algorithmType : the type of algorithm. Choices are classic, which is the original implementation described by Rabiner 1989, or efficient, which is modified for improved computational efficiency. [classic, efficient] (Default: efficient)

The BestHaplotypePathPlugin requires a haplotype graph, which can be obtained with the HaplotypeGraphBuilderPlugin. An example for chaining these plugins is below:

/tassel-5-standalone/run_pipeline.pl -debug ${fullXmx} \
-configParameters ${CONFIGFILE} \
-HaplotypeGraphBuilderPlugin \
    -configFile ${CONFIGFILE} \
    -methods ${HAPLOTYPE_METHOD_FIND_PATH} \
    -includeVariantContexts false \
    -includeSequences false \
    -endPlugin \
-BestHaplotypePathPlugin \
    -keyFile ${PATH_KEY_FILE} \
    -outDir ${HAP_COUNT_BEST_PATH_DIR} \
    -readMethod ${HAP_COUNT_METHOD} \
    -pathMethod ${PATH_METHOD} \
    -endPlugin

Troubleshooting

  1. If you see this error: ERROR net.maizegenetics.plugindef.AbstractPlugin - Haplotype count methodid not found in db for method : HAP_COUNT_METHOD, it means that something went wrong during the ReadMapping step. Double check the -v parameters and make sure the mmi file is in /tempFileDir/outputDir/pangenome/

Return to Step 3 pipeline

Return to Wiki Home