Create and Populate a PHG database with haplotypes
Before creating the db, you need to setup the PHG directory structure. We recommend using the default directory structure, which can be created as described in step 1A below. When the default directory is created, two of the files created are config.txt and load_genome_data.txt. Those files need to have values added or modified by the user before other steps can be run.
A sample config file can be viewed here
There are two steps to building and populating a PHG database. The first step creates the PHG database and populates it with the reference genome. The second step populates the database with additional haplotypes that have been aligned to the reference genome reference intervals.
Step 1: Create database and load the reference genome and genome intervals
This part of the pipeline populates the database with the reference genome, broken into "reference ranges" based on user-defined genome intervals. These same reference ranges are used to create haplotype blocks in later steps. The user can define different groups of reference ranges for different analyses.
Run Step 1
Use the links below to work through each step in the Step 1 flow chart
A. Create the default directory structure
C. Create bed file to define genome intervals
D. Optionally, set up additional groups of PHG intervals
Step 1D can be done at any time after the initial genome intervals have been loaded.
Step 2: Add haplotypes to a database
To run the Step 2 pipeline with default values
After replacing "/path/to" with the correct path on your computer and my_container_name with a meaningful name,
run docker run --name small_example_container --rm -v /path/to/dockerBaseDir/:/phg/ -t maizegenetics/phg /tassel-5-standalone/run_pipeline.pl -Xmx2G -debug -configParameters /phg/configSQLiteDocker.txt -PopulatePHGDBPipelinePlugin -endPlugin
For information on setting config parameters, see details for running PopulatePHGDBPipelinePlugin.
Individual parts of the step 2 pipeline can be run separately
A. Align assemblies to reference genome, add to DB, use one of these options. The alignment with mummer4 will take less time to run and is only available with PHG versions 0.0.40 and older. Anchorwave alignment can be very slow. But the quality of alignment when using anchorwave has been shown to be superior to other methods. Please check out this paper for more information on the anchorwave alignment process:
B. Align WGS fastq files to reference genome
C. Call variants from BAM file
D. Filter GVCF, add to database
Important: A GVCF file contains more information than a regular VCF file. We use the GATK GVCF format for the PHG.
E. Create consensus haplotypes
F. Optionally, set up groups of PHG taxa
Return to PHG version 0.0.40 Home