RequiredAnnotations

From B2BWiki

Jump to: navigation, search

The Steering committee approved Minimum Data Standards can be found here.

Revised 9/20/2012


Data submissions are divided into two separate catagories: Experiments and Analysis

Create a Project and Experiment to enter raw data. Projects are collections of related Experiments.
  • raw sequencing data without barcodes in Illumina fastq.gz, SOLiD formats, or unfiltered BAM format.
  • quality control files from the sample and library prep (e.g. bioanalyzer, gel picture, ...)
  • quality control output from fastQC (http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/) or SOLiD LifeScope
Create an Analysis for anything derived from raw data
  • alignments
  • differential gene expression lists
  • variant calls, peak calls


Data file requirements

  • Illumina data need to be concatenated to represent 1 fastq.gz file per lane, no per tile fastq files. What ever file format you use it should contain all reads and their quality scores.
  • Alignment Analysis:
  1. coordinate sorted, filtered, BAM alignment file(s) with index from Picard's SortSam NOT SAMTools (the later doesn't change the sorted flag)
  2. relative, per base, read coverage graph in bigWig or useq format (e.g. http://useq.sourceforge.net/cmdLnMenus.html#Sam2USeq and http://useq.sourceforge.net/cmdLnMenus.html#USeq2UCSCBig)


Meta data information and annotation requirements

A. For each Project,
  • Select lab from dropdown menu
  • Project Name (Must be self-explanatory to guest users!)
  • Project Description (e.g. purpose of the study or hypothesis, experimental approach, results)
B. For each Experiment, (Items marked * are required where appropriate.)
  • Lab or Core Name
  • Name of Submitter
  • Project folder
  • Organism
  • Experiment platform (Illumina HiSeq2000, Ion Torrent, SOLiD, ...)
  • Experiment type (ChIP-Seq, RNA-Seq, Bis-Seq, Exon, Whole Genome, ...)
  • Experiment Name
  • Experiment Description (descibes aspects of the experiment that distinquish it from other experiments within the project)
  • Sample source (Where did the sample come from? Cell Line, Whole Organism, Tissue, FACS Sorted Cells, ...)
  • Sample type (What was actually processed? DNA, chIP DNA, FFPE DNA, polyA RNA, cap RNA, small RNA, ...)
  • Sequencing Facility
  • Sequencing Library Prep Method
  • Sequencing Read Type (Single End, Paired End, Circularized Mate Paired End)
  • Directional Sequencing Library (Is the library stranded? Yes or No)
  • Sequencing Read Length (36, 50, 101... for Illumina and IonTorrent platforms)
  • Replica Number (Not total but for individual sample, 1st, 2nd, 3rd, 4th ....)
  • Stage
  • Cell line (for cell cultures)*
  • Tissue (e.g. atrium, ventricle, valve)*
  • Organ (e.g. heart, kidney)*
  • (For ChIP-Seq) Antibody antigen, vendor, catalog #, and lot #*


C. For Alignment Analysis,
  • aligner and version (e.g. novoalign V2.07.15b)
  • alignment command line with all parameters (e.g. novoalign -o SAM -F ILMFQ -t 150 -r A 50 -d hg19EnsTransRad46Num100kMin10SplicesChrPhiXAdaptr.nov.illumina.nix -f 8057X1_110428_SN141_0336_BC00JPABXX_8.txt.gz | grep -v ^@SQ | grep chr | gzip > 8057X1_110428_SN141_0336_BC00JPABXX_8.sam.gz )
  • genome build (e.g. mm9, hg19, zv9)


Highly Recommended Information

  1. Information about analysis tool and parameters for all Analysis:
    • analysis applications with versions (e.g. USeq_8.1.5)
    • analysis application command lines with all parameters (e.g. java -Xmx2G -jar USeq_8.1.5/Apps/ChIPSeq -m -i 20 -u -f satelliteRepeatsMm9.bed.gz -y sam -v M_musculus_Jul_2007 -u -e -s H3K18Ac_MB_vs_MNAse_MB -t H3K18Ac_MB/ -c MNAse_digested_input_MB/ &> log_H3K18Ac_MB_vs_MNAse_MB.txt &)
    • annotation source and version (e.g. ensembl 66)
  2. Files associated with analysis:
    A. Additional ChIP-Seq Analysis data files:
    • pairwise differential log2Ratio chIP enrichment/ reduction graph in bigWig or useq format
    • pairwise differential significance (p-value or FDR) enrichment/ reduction graph in bigWig or useq format
    • peak calls in bed.gz format
    • spreadsheet summary file for peak calls
    B. Additional RNA-Seq Analysis data files:
    • for known genes, a spreadsheet summary file with differential expression statistics
    • for novel transfrags, a pairwise differential log2Ratio chIP enrichment/ reduction graph in bigWig or useq format
    • for novel transfrags, a pairwise differential significance (p-value or FDR) enrichment/ reduction graph in bigWig or useq format
    C. Additional Bis-Seq Analysis data files;
    • per base fraction mC methylation graph in bigWig or useq format
    • pairwise differential bis-seq 1 vs bis-seq 2 window level fraction mC methylation graphs in bigWig or useq format
    • pairwise differential methylated region calls at various thresholds in *.bed.gz format
    • spreadsheet summary file for differentially methylated regions
    D. For exome and whole genome Variant Analysis:
    • variant calls in vcf format
    • spreadsheet summary file of candidate affected genes and gene pathways
Personal tools