QC Standard Analysis Pipeline

From B2BWiki

Jump to: navigation, search

Contents

Standard Analysis of CvDC Sequence Data

Source of data

The Gnomex database on the B2B server is queried to obtain metadata such as the raw data file (fastq) paths, the species, the sequencing applications (RNA-seq, ChIPSeq, Exome sequencing, etc.) for each sample.

Raw data QC

Fastqc is run on each raw sample file.
$ fastqc -outdir=<path> <input fastq file>

Pre-alignment QC transformations

All samples are quality and adapter trimmed by ea-utils' FastqMcf
For single read samples:
$ fastq-mcf -S -o <output path for read1> -o <output path for read2> <contaminants fasta> <read1> <read2>
For paired end samples:
$ fastq-mcf -S -o <output path> <contaminants fasta> <read1>

Alignment

RNA-seq
Trimmed reads are aligned with Tophat to the transcriptome and the genome. Paired end reads that map with out the mate pair matching are retained.
For single read samples:
$ tophat -o <path to output> --min-anchor=5 --segment-length=25 --no-coverage-search --segment-mismatches=2 --splice-mismatches=2 --microexon-search --GTF=<path to gtf> --num-threads=12 --rg-sample <sampleID> --rg-id <sampleId:Lane> <bowtie index path> <path to trimmed read1>
For paired end samples:
$ tophat -o <path to output> --min-anchor=5 --segment-length=25 --no-coverage-search --segment-mismatches=2 --splice-mismatches=2 --microexon-search --GTF=<path to gtf> --num-threads=12 --mate-inner-dist=150 --rg-sample <sampleID> --rg-id <sampleId:Lane> <bowtie index path> <path to trimmed read1> <path to trimmed read2>
ChipSeq
Under Construction
Exome Sequencing
Under Construction

Duplicate detection

RNA-seq
Picard tool's MarkDuplicates is used to mark the aligned reads that are PCR or optical duplicates.
$ java -Xmx8g -jar <MarkDuplicates path>INPUT=<input bam file> REMOVE_DUPLICATES=FALSE CREATE_INDEX=TRUE MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=204 OPTICAL_DUPLICATE_PIXEL_DISTANCE=10 METRICS_FILE=<path to metrics file> OUTPUT=<output Path>
Note: the numerical command line parameters are system specific.
ChipSeq
Under Construction
Exome Sequencing
Under Construction

PostAlignment QC

RNA-seq
RseQC's modules are used to quality-check the bam after duplicates have been marked. Specifically:
  • geneBody_coverage.py - Measures the distribution of the reads across the body of the gene showing features such as uniformity of distribution across the gene as well as 5' and 3' biases.
$ geneBody_coverage.py -i <input bam> -r <refgene gene table> -o <out path base name>
  • inner_distance.py (for paired end experiments only) - Measures the distribution of inner-mate pair distance across the experiment.
$ inner_distance.py -ii <input bam> -r <refgene gene table> -u 700 -o <out path base name>
  • read_duplication.py - Measures the read duplication rates across the sample.
$ read_duplication.py -i <input bam> -o <output path base name>
  • RPKM_saturation.py - Does multiple subsampling of the data, calculating RPKM for the transcripts at each subsampling level. This can be used to get a sense of whether the current depth of sequencing was sufficiently deep.
$ RPKM_saturation.py -i <input bam> -r <refgene gene table> -o <output path base name>
  • junction_saturation.py - Similar to the RPKM saturation, this subsamples the data, counting unique splice junctions at each level of sub sampling. All junctions should be observable in a saturated data set, so this can be used to determine whether the dataset is sufficiently saturated for differential splice analysis.
$ junction_saturation.py -i <input bam> -r <refgene gene table> -o <outpath base name>
  • read_distribution.py - Counts the bases and reads in various regions relative to genes such as in CDS exons, 5’UTR Exons, 3’UTR Exons, Introns, TSS up 1kb, TSS up 5kb, TSS up 10kb, TES down 1kb, TES down 5kb, and TES down 10kb.
$ read_distribution.py -i <input bam> -r <refgene gene table> > <outpath>
ChipSeq
Under Construction
Exome Sequencing
Under Construction

Example Output Folder and File Structure for a Sample

├── fastqc                                                                                                                                                                                                                                                          [50/21038]
│   ├── 100811_JWH9B_jwamstad_Boyer_L7_1_sequence_fastqc
│   │   ├── fastqc_data.txt
│   │   ├── fastqc_report.html
│   │   ├── Icons
│   │   │   ├── error.png
│   │   │   ├── fastqc_icon.png
│   │   │   ├── tick.png
│   │   │   └── warning.png
│   │   ├── Images
│   │   │   ├── duplication_levels.png
│   │   │   ├── kmer_profiles.png
│   │   │   ├── per_base_gc_content.png
│   │   │   ├── per_base_n_content.png
│   │   │   ├── per_base_quality.png
│   │   │   ├── per_base_sequence_content.png
│   │   │   ├── per_sequence_gc_content.png
│   │   │   ├── per_sequence_quality.png
│   │   │   └── sequence_length_distribution.png
│   │   └── summary.txt
│   ├── 100811_JWH9B_jwamstad_Boyer_L7_1_sequence_fastqc.zip
│   ├── 100811_JWH9B_jwamstad_Boyer_L7_2_sequence_fastqc
│   │   ├── fastqc_data.txt
│   │   ├── fastqc_report.html
│   │   ├── Icons
│   │   │   ├── error.png
│   │   │   ├── fastqc_icon.png
│   │   │   ├── tick.png
│   │   │   └── warning.png
│   │   ├── Images
│   │   │   ├── duplication_levels.png
│   │   │   ├── kmer_profiles.png
│   │   │   ├── per_base_gc_content.png
│   │   │   ├── per_base_n_content.png
│   │   │   ├── per_base_quality.png
│   │   │   ├── per_base_sequence_content.png
│   │   │   ├── per_sequence_gc_content.png
│   │   │   ├── per_sequence_quality.png
│   │   │   └── sequence_length_distribution.png
│   │   └── summary.txt
│   └── 100811_JWH9B_jwamstad_Boyer_L7_2_sequence_fastqc.zip
├── Tophat_B2B7R--7X1
│   ├── accepted_hits.bam
│   ├── accepted_hits_mark_dupe.bai
│   ├── accepted_hits_mark_dupe.bam
│   ├── accepted_hits_mark_dupe.bam_mark_dupe_metrics.txt
│   ├── accepted_hits_mark_dupe_qc
│   │   ├── gene_body_coverage.geneBodyCoverage.pdf
│   │   ├── gene_body_coverage.geneBodyCoverage_plot.r
│   │   ├── gene_body_coverage.geneBodyCoverage.txt
│   │   ├── inner_distance.inner_distance_freq.txt
│   │   ├── inner_distance.inner_distance_plot.pdf
│   │   ├── inner_distance.inner_distance_plot.r
│   │   ├── inner_distance.inner_distance.txt
│   │   ├── junction_saturation.junctionSaturation_plot.pdf
│   │   ├── junction_saturation.junctionSaturation_plot.r
│   │   ├── read_distribution.txt
│   │   ├── read_duplication.DupRate_plot.pdf
│   │   ├── read_duplication.DupRate_plot.r
│   │   ├── read_duplication.pos.DupRate.xls
│   │   ├── read_duplication.seq.DupRate.xls
│   │   ├── RPKM_saturation.eRPKM.xls
│   │   ├── RPKM_saturation.rawCount.xls
│   │   ├── RPKM_saturation.saturation.pdf
│   │   └── RPKM_saturation.saturation.r
│   ├── deletions.bed
│   ├── insertions.bed
│   ├── junctions.bed
│   ├── logs
│   │   ├── bam_merge_um.log
│   │   ├── bowtie_build.log
│   │   ├── bowtie.left_kept_reads.log
│   │   ├── bowtie.left_kept_reads.m2g_um.log
│   │   ├── bowtie.left_kept_reads.m2g_um_unmapped.log
│   │   ├── bowtie.right_kept_reads.log
│   │   ├── bowtie.right_kept_reads.m2g_um.log
│   │   ├── bowtie.right_kept_reads.m2g_um_unmapped.log
│   │   ├── g2f.err
│   │   ├── g2f.out
│   │   ├── gtf_juncs.log
│   │   ├── juncs_db.log
│   │   ├── long_spanning_reads.segs.log
│   │   ├── m2g_left_kept_reads.err
│   │   ├── m2g_left_kept_reads.out
│   │   ├── m2g_right_kept_reads.err
│   │   ├── m2g_right_kept_reads.out
│   │   ├── prep_reads.log
│   │   ├── reports.log
│   │   ├── reports.merge_bam.log
│   │   ├── reports.samtools_sort.log0
│   │   ├── reports.samtools_sort.log1
│   │   ├── reports.samtools_sort.log10
│   │   ├── reports.samtools_sort.log11
│   │   ├── reports.samtools_sort.log2
│   │   ├── reports.samtools_sort.log3
│   │   ├── reports.samtools_sort.log4
│   │   ├── reports.samtools_sort.log5
│   │   ├── reports.samtools_sort.log6
│   │   ├── reports.samtools_sort.log7
│   │   ├── reports.samtools_sort.log8
│   │   ├── reports.samtools_sort.log9
│   │   ├── run.log
│   │   ├── segment_juncs.log
│   │   └── tophat.log
│   ├── prep_reads.info
│   ├── RunLog-Tophat_B2B7R--7X1.log
│   └── unmapped.bam
└── trimmed
    ├── 100811_JWH9B_jwamstad_Boyer_L7_1_sequence.txt.gz.MCF.gz
    ├── 100811_JWH9B_jwamstad_Boyer_L7_1_sequence.txt.gz.MCF.skip.gz
    ├── 100811_JWH9B_jwamstad_Boyer_L7_2_sequence.txt.gz.MCF.gz
    ├── 100811_JWH9B_jwamstad_Boyer_L7_2_sequence.txt.gz.MCF.skip.gz
    └── mcfrun.log

Flow Chart for Data Processing and Visualization

B2BQC DATA FLOW.jpg

Personal tools