Getting started

Best practices are implemented in a python framework.

clustering of small RNA sequences

seqcluster generates a list of clusters of small RNA sequences, their genome location, their annotation and the abundance in all the sample of the project



I am currently using cutadapt:

cutadapt --adapter=$ADAPTER --minimum-length=8 --untrimmed-output=sample1_notfound.fastq -o sample1_clean.fastq -m 17 --overlap=8 sample1.fastq


To reduce computational time, I recommend to collapse sequences, also it would help to apply filters based on abundances. Like removing sequences that appear only once.

seqcluster collapse -f sample1_clean.fastq -o collapse

Here I am only using sequences that had the adapter, meaning that for sure are small fragments.


seqcluster prepare -c file_w_samples -o res --minl 17 --minc 2 --maxl 45

the file_w_samples should have the following format:

lane1_sequence.txt_1_1_phred.fastq      cc1
lane1_sequence.txt_2_1_phred.fastq      cc2
lane2_sequence.txt_1_1_phred.fastq      cc3
lane2_sequence.txt_2_1_phred.fastq      cc4

two columns file, where the first column is the name of the file with the small RNA sequences for each sample, and the second column in the name of the sample.

The fastq files should be like this:


Where _x[09] indicate the abundance of that sequence, and the middle number is the index of the sequence.

This script will generate: seqs.fastq and * seqs.fastq: have unique sequences and unique ids * is the abundance matrix of all unique sequences in all samples


You should use an aligner to map seqs.fa to your genome. A possibility is bowtie or STAR. From here, we need a file in BAM format for the next step. VERY IMPORTANT: the BAM file should be sorted

bowtie -a --best --strata -m 5000 INDEX seqs.fastq -S | samtools view -Sbh /dev/stdin | samtools sort -o /dev/stdout temp > seqs.sort.bam


STAR --genomeDir $star_index_folder --readFilesIn res/seqs.fastq --alignIntronMax 1  --outFilterMultimapNmax 1000 --outSAMattributes NH HI NM --outSAMtype BAM SortedByCoordinate


seqcluster cluster -a res/Aligned.sortedByCoord.out.bam  -m res/ -g $GTF_FILE  -o res/cluster -ref PATH_TO_GENOME_FASTA --db example
  • -a is the SAM file generated after mapped with your tool, which input has been seqs.fa
  • -m the previous seqs.fa
  • -b annotation files in bed format (see below examples) [deprecated]
  • -g annotation files in gtf format (see below examples) [recommended]
  • -i genome fasta file used in the mapping step (only needed if -s active)
  • -o output folder
  • -ref genome fasta file. Needs fai file as well there. (i.e hg19.fa, hg19.fa.fai)
  • -d create debug logging
  • -s construction of putative precursor (NOT YET IMPLEMENTED)
  • –db (optional) will create sqlite3 database with results that will be used to browse data with html web page (under development)

Example of a bed file for annotation (the fourth column should be the name of the feature):

chr1    157783  157886  snRNA   0       -

Strongly recommend gtf format. Bed annotation is deprecated. Go here to know how to download data from hg19 and mm10.

Example of a gtf file for annotation (the third column should be the name of the feature and the value after gene name attribute is the specific annotation):

chr1    source  miRNA      1       11503   .       +       .       gene name 'mir-102' ;

hint: scripts to generate human and mouse annotation are inside seqcluster/scripts folder.


  • counts.tsv: count matrix that can be input of downstream analyses
  • size_counts.tsv: size distribution of the small RNA by annotation group
  • seqcluster.json: json file containing all information
  • log/run.log: all messages at debug level
  • log/trace.log: to keep trace of algorithm decisions

Interactive HTML Report

This will create html report using the following command assuming the output of seqcluster cluster is at res:

seqcluster report -j res/seqcluster.json -o report -r $GENONE_FASTA_PATH

where $GENOME_FASTA_PATH is the path to the genome fasta file used in the alignment.

Note: you can try our new visualization tool!

  • report/html/index.html: table with all clusters and the annotation with sorting option
  • report/html/[0-9]/maps.html: summary of the cluster with expression profile, annotation, and all sequences inside
  • report/html/[0-9]/maps.fa: putative precursor

An example of the output is below:

Easy start with

Note:If you already are using bcbio, visit bcbio to run the pipeline there.

To install the small RNA data: upgrade -u development --tools --datatarget smallrna

Options to run in a cluster

It uses ipython-cluster-helper to send jobs to nodes in the cluster

  • –parallel should set to ipython
  • –scheduler should be set to sge,lsf,slurm
  • –num-jobs indicates how much jobs to launch. It will run samples independently. If you have 4 samples, and set this to 4, 4 jobs will be launch to the cluster
  • –queue the queue to use
  • –resources allows to set any special parameter for the cluster, such as, email in sge system:

Read complete usability here: An examples in slurm system is:

--parallel ipython --scheduler slurm --num-jobs 4 --queue general


  • one folder for each analysys, and inside one per sample
  • adapter: *clean.fastq is the file after adapter removal, *clean_trimmed.fastq is the collapse clean.fastq, *fragments.fastq is file without adapter, *short.fastq is file with reads < 16 nt.
  • align: BAM file results from align trimmed.fastq
  • mirbase: file with miRNA anotation and novel miRNA discovery with mirdeep2
  • tRNA: analysis done with tdrmapper [citation needed]
  • qc: *_fastqc.html is the fastqc results from the uncollapse fastq file
  • seqcluster: is the result of running seqcluster. See its documentation for further information.
  • report/srna-report.Rmd: template to create a quick html report with exploration and differential expression analysis. See example here