.. _mirna_annotation: *************** miRNA annotation *************** miRNA annotation is running inside `bcbio small RNAseq pipeline `_ together with other tools to do a complete small RNA analysis. For some comparison with other tools go `here `_. You can run samples after processing the reads as shown below. Currently there are two version: JAVA **Naming** See always up to date information here in `mirtop open project `_. It is a working process, but since 10-21-2015 isomiR naming has changed to: * Nucleotide substitution: ``NUMBER|NUCLEOTIDE_ISOMIR|NUCLEOTIDE_REFERENCE`` means at the position giving by the number the nucleotide in the sequence has substituted the nucleotide in the reference. This, as well, is a post-transcriptional modification. * Additions at 3' end: ``0/NA`` means no modification. ``UPPER CASE LETTER`` means addition at the end. Note these nucleotides don't match the precursor. So they are post-transcriptional modification. * Changes at 5' end: ``0/NA`` means no modification. ``UPPER CASE LETTER`` means nucleotide insertions (sequence starts before miRBase mature position). ``LOWWER CASE LETTER`` means nucleotide deletions (sequence starts after miRBase mature position). * Changes at 3' end: ``0/NA`` means no modification. ``UPPER CASE LETTER`` means nucleotide insertions (sequence ends after miRBase mature position). ``LOWWER CASE LETTER`` means nucleotide deletions (sequence ends before miRBase mature position). Processing of reads ------------------- **REMOVE ADAPTER** I am currently using ``cutadapt``. :: cutadapt --adapter=$ADAPTER --minimum-length=8 --untrimmed-output=sample1_notfound.fastq -o sample1_clean.fastq -m 17 --overlap=8 sample1.fastq **COLLAPSE READS** To reduce computational time, I recommend to collapse sequences, also it would help to apply filters based on abundances. Like removing sequences that appear only once. :: seqcluster collapse -f sample1_clean.fastq -o collapse Here I am only using sequences that had the adapter, meaning that for sure are small fragments. The output will be named as ``sample1_clean_trimmed.fastq`` Prepare databases ----------------- For human or mouse, follows `this instruction `_ to download easily miRBase files. In general you only need hairpin.fa and miRNA.str from miRBase site. `mirGeneDB `_ is also supported, download the needed files `here `_. **Highly recommended to filter hairpin.fa to contain only the desired species.** miRNA/isomiR annotation with JAVA --------------------------------- **MIRALIGNER** Download the tool from `miraligner`_ repository. .. _miraligner: https://github.com/lpantano/seqbuster/raw/miraligner/modules/miraligner/miraligner.jar Download the mirbase files (`hairpin`_ and `miRNA`_) from the ftp and save it to `DB` folder. .. _hairpin: ftp://mirbase.org/pub/mirbase/CURRENT/hairpin.fa.zip .. _miRNA: ftp://mirbase.org/pub/mirbase/CURRENT/miRNA.str.zip You can map the miRNAs with. :: java -jar miraligner.jar -sub 1 -trim 3 -add 3 -s hsa -i sample1_clean_trimmed.fastq -db DB -o output_prefix **Cite** SeqBuster is a bioinformatic tool for the processing and analysis of small RNAs datasets, reveals ubiquitous miRNA modifications in human embryonic cells. Pantano L, Estivill X, Martí E. *Nucleic Acids Res. 2010 Mar;38(5):e34. Epub 2009 Dec 11.* **NOTE:** `Check comparison of multiple tools `_ for miRNA annotation. Convert to GFF3-srna --------------------- Use mirtop to convert to GFF3-srna format. This is the desired format to share the isomiR information and can be used to join multiple projects together easily. See _`this ` to know how to convert all the output into a single file and share easily with collaborators:: mirtop gff --format seqbuster --sps hsa --hairpin database/hairpin.fa --gtf database/hsa.gff3 -o test_out out_folder/*/*.mirna .. _miRNAgff: ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff3 Post-analysis with R -------------------- Use the outputs to do differential expression, clustering and descriptive analysis with this package: `isomiRs `_ To load the data you can use `IsomirDataSeqFromFiles function `_ and get the count data with `isoCounts `_ to move to DESeq2 or similar packages. Manual of miraligner(JAVA) -------------------------- **options** Add ``-freq`` if you have your fasta/fastq file with this format and you want a third column with the frequency (normally value after x character):: >seq_1_x4 CACCGCTGTCGGGGAACCGCGCCAATTT Add ``-pre`` if you want also sequences that map to the precursor but outside the mature miRNA * Parameter `-sub`: mismatches allowed (0/1) * Parameter `-trim`: nucleotides allowed for trimming (max 3) * Parameter `-add`: nucleotides allowed for addition (max 3) * Parameter `-s`: species (3 letter, human=>hsa) * Parameter `-i`: fasta file * Parameter `-db`: folder where miRBase files are(one copy at miraligner-1.0/DB folder) * Parameter `-o`: prefix for the output files * Parameter `-freq`: add frequency of the sequence to the output (just where input is fasta file with name matching this patter: >seq_3_x67) * Parameter `-pre`: add sequences mapping to precursors as well **input** A fasta/fastq file reads:: >seq CACCGCTGTCGGGGAACCGCGCCAATTT or tabular file with counts information:: CACCGCTGTCGGGGAACCGCGCCAATTT 45 **output** Track file *.mirna.opt: information about the process Non mapped sequences will be on *.nomap Header of the *.mirna.out file: * seq: sequence * freq/name: depending on the input this column contains counts (tabular input file) or name (fasta file) * mir: miRNA name * start: start of the sequence at the precursor * end: end of the sequence at the precursor * mism: nucleotide substitution position | nucleotide at sequence | nucleotide at precursor * addition: nucleotides at 3 end added:: precursor => cctgtggttagctggttgcatatcc annotated miRNA => TGTGGTTAGCTGGTTGCATAT sequence add: TT => TGTGGTTAGCTGGTTGCATATTT * tr5: nucleotides at 5 end different from the annonated sequence in miRBase:: precursor => cctgtggttagctggttgcatatcc annotated miRNA => TGTGGTTAGCTGGTTGCATAT sequence tr5: CC => CCTGTGGTTAGCTGGTTGCATAT sequence tr5: tg => TGGTTAGCTGGTTGCATAT * tr3: nucleotides at 3 end different from the annotated sequence in miRBase:: precursor => cctgtggttagctggttgcatatcc annotated miRNA => TGTGGTTAGCTGGTTGCATAT sequence tr3: cc => TGTGGTTAGCTGGTTGCATATCC sequence tr3: AT => TGTGGTTAGCTGGTTGCAT * s5: offset nucleotides at the begining of the annotated miRNAs:: precursor => agcctgtggttagctggttgcatatcc annotated miRNA => TGTGGTTAGCTGGTTGCATAT s5 => AGCCTGTG * s3:offset nucleotides at the ending of the annotated miRNAs:: precursor => cctgtggttagctggttgcatatccgc annotated miRNA => TGTGGTTAGCTGGTTGCATAT s3 => ATATCCGC * type: mapped on precursor or miRNA sequences * ambiguity: number of different detected precursors Example:: seq miRNA start end mism tr5 tr3 add s5 s3 DB amb TGGCTCAGTTCAGCAGGACC hsa-mir-24-2 50 67 0 qCC 0 0 0 0 precursor 1 ACTGCCCTAAGTGCTCCTTCTG hsa-miR-18a* 47 68 0 0 0 tG ATCTACTG CTGGCA miRNA 1