bioinfo@ird.fr

Trainings 2019 – RNAseq – Practice


layout: page
title: "RNASeq Practice"
permalink: /linux/rnaseqPractice/
tags: [ rnaseq, survival guide ]
description: RNASeq Practice page

Description Hands On Lab Exercises for RNASeq
Related-course materials Transcriptomique
Authors Julie Orjuela (julie.orjuela@irf.fr), Gautier Sarah (gautier.sarah@cirad.fr), Catherine Breton (c.breton@cgiar.org), Aurore Comte (aurore.compte@ird.fr), Alexis Dereeper (alexis.dereeper@ird.fr), Sebastien Ravel (sebastien.ravel@cirad.fr), Sebastien Cunnac (sebastien.cunnac@ird.fr)
Creation Date 15/03/2018
Last Modified Date 17/05/2019

Summary


Practice 1 : Mapping against transcriptome reference + counting with Kallisto

Practice1 will be performed in the Galaxy environment.

We will perform a transcriptome-based mapping and estimates of transcript levels using Kallisto, and a differential analysis using EdgeR.

  • Connect to Galaxy IRD
  • Create a new history and import all the 8 RNASeq samples datasets (paired-end fastq files) from Data library
    Shared Data => Data Libraries => Galaxy_trainings_2019 => RNASeq
  • Upload the Chr1 of rice transcriptome (cDNA) to be used as reference - http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_7.0/chr01.dir/Chr1.cdna
  • Run the kallisto quant program by providing Chr1 as transcriptome reference and specifying correctly pairs of input fastq- kallisto quant
    You can do it with the pairs made one by one manually or you can make lists of dataset pairs. If you choose this second option:
  • Build one list with the pairs of condition 1 and on other list with the pairs of condition 2.
  • launch kallisto on each of the two lists => you get 2 kallisto outputs collections
  • Convert kallisto outputs (collection of count files) into one single file that can be used as input for EdgeR -Kallisto2EdgeR

Practice 2 : Mapping against annotated genome reference with Hisat2 + counting with Stringtie

Running Hisat2 and Stringtie with TOGGLe

Connection to account in IRD i-Trop cluster ssh formationX@bioinfo-master.ird.fr

Input data are accessible from :

  • Input data : /data2/formation/tp-toggle/RNASeqData/
  • Reference : /data2/formation/tp-toggle/RNASeqData/referenceFiles/chr1.fasta
  • Annotation : /data2/formation/tp-toggle/RNASeqData/referenceFiles/chr1.gff3
  • Config file: RNASeqReadCount.config.txt

Import data from nas to your $HOME

  • Create a toggleTP directory in your $HOME mkdir ~/TP-RNASEQ
  • Make à copy for reference and input data into TP-RNASEQ directory: cp -r /data2/formation/tp-toggle/RNASeqData/ ~/TP-RNASEQ
  • Add the configuration file used by TOGGLe cd ~/TP-RNASEQ/; wget https://raw.githubusercontent.com/SouthGreenPlatform/TOGGLE/master/exampleConfigs/RNASeqHisat2Stringtie.config.txt
  • change SGE key $sge as below using a texte editor like nano nano RNASeqHisat2Stringtie.config.txt
$sge
-q formation.q
-b Y
-cwd
  • in TOGGLe configuration file use /scratch in $scp key to launch your job from scratch folder and also $env key using
    module load bioinfo/TOGGLE-dev/0.3.7 module installed on cluster.
  • Check parameters of every step in ~/toggleTP/RNASeqData/RNASeqHisat2Stringtie.config.txt as recommended by https://www.nature.com/articles/nprot.2016.095.

Mapping is performed using HISAT2 and usually the first step, prior to mapping, is to create an index of the reference genome. TOGGle index genome automatically if indexes are absents in reference folder.
In order to save some space think to only keep sorted BAM files one these are created.
After mapping, assemble the mapped reads into transcripts. StringTie can assemble transcripts with or without annotation, annotation can be helpful when the number of reads for a transcript is too low for an accurate assembly.

$order
1=fastqc
2=hisat2
3=samtoolsView
4=samtoolsSort
5=stringtie 1
1000=stringtie 2

$samtoolsview
-b
-h

$samtoolssort

$cleaner
3

#PUT YOUR OWN SGE CONFIGURATION HERE
$sge
-q formation.q
-b Y

$stringtie 1

$stringtie 2
--merge

$hisat2
--dta

$scp
/scratch/

$env
module load bioinfo/TOGGLE-dev/0.3.7

... Your data are now in ~/TP-RNASEQ.

Now, create a runTOGGLeRNASEQ.sh bash script to launch TOGGLe as follow :

#!/bin/bash
#$ -N TOGGLeRNAseq
#$ -b yes
#$ -q formation.q
#$ -cwd
#$ -V

dir="~/TP-RNASEQ/RNASeqData/fastq"
out="~/TP-RNASEQ/outTOGGLe"
config="~/TP-RNASEQ/RNASeqHisat2Stringtie.config.txt"
ref="~/TP-RNASEQ//RNASeqData/referenceFiles/chr1.fasta"
gff="~/TP-RNASEQ//RNASeqData/referenceFiles/chr1.gff3"
## Software-specific settings exported to user environment
module load bioinfo/TOGGLE-dev/0.3.7

#running tooglegenerator 
toggleGenerator.pl -d $dir -c $config -o $out -r $ref -g $gff --report --nocheck;

echo "FIN de TOGGLe ^^"

Convert runTOGGLeRNASEQ in an executable file with chmod +x runTOGGLeRNASEQ.sh

Launch runTOGGLeRNASEQ.sh in qsub mode

qsub -q formation.q -N TOGGLeRNASEQ -b yes -cwd 'module load bioinfo/TOGGLE-dev/0.3.7; ./runTOGGLeRNASEQ.sh '

Explore output outTOGGLe TOGGLe and check if everything was ok.

Open and exploreoutTOGGLe/finalResults/intermediateResults.STRINGTIEMERGE.gtf

Now that we have our assembled transcripts, we can estimate their abundances.

To estimate abundances, we have to run again stringtie using options -B and -e.

prepare data
  • Create symbolics links to order data before transferring them to /scratch
MOI="formationX"
OUTPUT="/home/$MOI/TP-RNASEQ/outTOGGLe/"
mkdir $OUTPUT/stringtieEB
cd $OUTPUT/stringtieEB 
ln -s $OUTPUT/finalResults/intermediateResults.STRINGTIEMERGE.gtf .
ln -s $OUTPUT/output/*/4_samToolsSort/*SAMTOOLSSORT.bam .
transfert to /scratch
  • Remember good practices to work at IRD Cluster. You have to copy data into a path /scratch in a node. What is your node number?
qrsh -q formation.q
MOI="formationX"
OUTPUT="/home/$MOI/TP-RNASEQ/outTOGGLe/"
scp -r nas:$OUTPUT/stringtieEB /scratch/$MOI 
cd /scratch/$MOI
Recovery annotations
  • Before merging gtf files obtained by stringtie, we have to recover the annotations in order to see the known genes names in gtf file. Stringtie annotate transcripts using gene id 'MSTRG.1' nomenclature . See https://github.com/gpertea/stringtie/issues/179
module load system/python/3.6.5
python3 /data2/formation/TP_RNA-seq_2019/gpertea-scripts/mstrg_prep.py intermediateResults.STRINGTIEMERGE.gtf > intermediateResults.STRINGTIEMERGE_prep.gtf
  • Compare gtf files before and after running mstrg_prep.py script. To do it you can choose a gene and explore differences:
grep 'LOC_Os01g01010.1' intermediateResults.STRINGTIEMERGE*
gffcompare
  • Let’s compare the StringTie transcripts to known transcripts using gffcompare https://github.com/gpertea/gffcompare and explore results. Observe statistics. How many "J", "U" and "=" do you obtain?. gffcompare_out.annotated.gtf file will be visualised with IGV later.
scp ~/TP-RNASEQ//RNASeqData/referenceFiles/chr1.fasta .
/data2/formation/TP_RNA-seq_2019/gffcompare/gffcompare -r chr1.gff3 -o gffcompare_out  intermediateResults.STRINGTIEMERGE_prep.gtf
Stringtie -e -B
  • ... Finally, we launch stringtie: (change nodeXX by your node number)
for i in *bam ; do eval "mkdir ${i/.SAMTOOLSSORT.bam/}; qsub -q formation.q@nodeXX -N stringtie2 -cwd -V -b yes 'module load bioinfo/stringtie/1.3.4; stringtie" $PWD"/"$i "-G $PWD"/"intermediateResults.STRINGTIEMERGE_prep.gtf -e -B -o $PWD/${i/.SAMTOOLSSORT.bam/}/${i/bam/count}'"; done
Convert to counts table
  • Convert stringtie output in counts using prepDE.py. Dont forget! You are in /scratch /scratch/formationX
mkdir counts
cd counts/
ln -s /scratch/$MOI/*/*.count .
for i in *.count; do echo ${i/.SAMTOOLSSORT.count/} $PWD/$i; done > listGTF.txt
python2 /data2/formation/TP_RNA-seq_2019/prepDE.py -i listGTF.txt

You have obtained gene_count_matrix.csv and transcript_count_matrix.csv

Transfer data from /scratch to your home on cluster
  • Don't forget scp *.counts files to your $OUTPUT
scp -r /scratch/$MOI/counts/ ~/TP-RNASEQ/
scp -r /scratch/$MOI/gffcompare*/ ~/TP-RNASEQ/
Transfer data to local machine
  • From your local terminal, transfer counts to your local machine with scp
scp -r formationX@bioinfo-nas.ird.fr:/home/formationX/TP-RNASEQ/counts/ .
  • Transfer also reference files fasta, gff and gffcompare_out.annotated.gtf to use it later with IGV.
scp -r formationX@bioinfo-nas.ird.fr:/home/formationX/toggleTP/RNASeqData/referenceFiles/*.gff3 .
scp -r formationX@bioinfo-nas.ird.fr:/home/formationX/toggleTP/RNASeqData/referenceFiles/*.fasta .
scp -r formationX@bioinfo-nas.ird.fr:/home/formationX/TP-RNASEQ/gffcompare* .

Practice 3 : Differential expression analysis using EdgeR and DESeq2

Practice 3 will be performed in PIVOT via R Studio.

PIVOT: Platform for Interactive analysis and Visualization Of Transcriptomics data
Qin Zhu, Junhyong Kim Lab, University of Pennsylvania
Oct 26th, 2017

Intallation of PIVOT

Dependencies that needs to be manually installed.
You may need to paste the following code line by line
and choose if previously installed packages should be updated (recommended).

install.packages("devtools") 
library("devtools")
install.packages("BiocManager")
BiocManager::install("BiocUpgrade") 
#BiocManager::install("GO.db")
BiocManager::install("HSMMSingleCell")
#BiocManager::install("org.Mm.eg.db")
#BiocManager::install("org.Hs.eg.db")
BiocManager::install("DESeq2")
BiocManager::install("SingleCellExperiment")
BiocManager::install("scater")
BiocManager::install("monocle")
BiocManager::install("GenomeInfoDb")

Install PIVOT

install_github("qinzhu/PIVOT")
BiocManager::install("BiocGenerics") # You need the latest BiocGenerics >=0.23.3

Launch PIVOT

To run PIVOT, in Rstudio console related to a web shiny interface, use command

library(PIVOT)
pivot()

Go to your web browser.


Introduction

As you have seen, RNA sequencing (RNA-seq) is an experimental method for measuring RNA expression levels via high-throughput sequencing of small adapter-ligated fragments.
The number of reads that map to a gene is the proxy for its expression level.
There are many steps between receiving the raw reads from the sequencer and gaining biological insight.
In this tutorial, we will start with a "Table of counts" and end with a "List of differentially expressed genes".

Step 1 : Data Input

To input expression matrix, select “Counts Table” as input file type. PIVOT expects the count matrix to have rows as genes and samples as columns.
Gene names and sample names should be the first column and the first row, respectively.

  • Go to file Tab.
  • Take the count file gene_count_matrix.csv generated previously.
  • Import this file into Data input and then Input file type.
  • Add Use Tab separator to Skip Rows.
  • Check if yours data are imported in the rigth window.

Step 2 : Input Design Information

The design infomation are used for sample point coloring and differential expression analysis. Users can input the entire sample meta sheet as
design information, or manually specify groups or batches for each sample.

  • Go to DESIGN.
  • Go to Designed Table Upload. Upload info.txt
  • Verify that the header of the info file corresponds to the count file.
  • Choose the Separator : Space or the appropriate separator.
  • Verify on the Design Table Preview and submit design.
  • Choose the Normalieation Method :

    • for Edge R you can use DESeq, Trimmed Mean of M-values TMM, or Upperquartile.
    • for DESeq you can use DESeq, Modifed DESeq

    In order to have a quick view of your chosen data, look at the summary.

step 3 : Feature Filtering

  • There are currently 3 types of feature filter in PIVOT: the expression filter, which filters based on various expression statistics;
  • You can choose the filter criteria.

Step 4 : Select samples

  • Go To SAMPLE
  • Select your sample and condition,
    • Subset by sample and or subset by list, as you can create different experiment analysis.

DATA MAP draw a summary of your different analysis, so you can save the history of your analysis.

Step 5 : Basic statistics

  • If you want to keep the count table, upload it.
  • Check the distribution of each condition in the standard deviation graph, the dispersion graph.
  • If needed, you can download the Variably Expressed Genes, and on the graph, you can see the dispersion of your data.

First part with EdgeR on PIVOT

  • Run the EdgeR program for differential analysis - EdgeR
  • Check relevance of normalized expression values provided by EdgeR
  • Observe MDS plot of experimental conditions. Observe Smear plot.

Questions :

  • Using filters parameters, determine how many genes are found to be differentally expressed using a minimum pvalue <= 0.05, 0.1? Using a minimum FDR-adjusted pvalue <= 0.05, 0.1?

Step 6 : Differential Expression with EdgeR

Once data have been normalized in the Step 1, you can choose the method to find the Differential expression gene between the condition previously choosen.
In edgeR, it is recommended to remove genes with very low counts. Remove genes (rows) which have zero counts for all samples from the dge data object.

  • Go to Differential Expression, choose edgeR
  • According to your normalized method choice in step 1, notify the Data Normalization method, and choose Experiment Design, and condition.
  • For this dataset choose condition, condition
  • Choose the Test Method, Exact test, GLM likelihood ratio test, and GLM quasi-likelihood F-test.
  • For this dataset choose Exact test

Then Run Modeling

  • Choose the P adjustement adapted
  • For this dataset choose False discovery rate or Bonferroni correction
  • Choose FDR cutoff 0.1 and 0.05
  • Choose Term 1 and Term 2

You obtain a table containing the list of the differential gene expressed according to your designed analysis. This table contains logFoldChange, logCountPerMillion, PValue, FDR.
This table can be download in order to use it for other analysis. The Mean-Difference Plot show you the up-regulated gene in green, and Down regulated gene in black.

  • Keep this file edgeR_results.csv
  • Change the name of the file according to your analysis.
  • Example : edgeR_results_FDR_01.csv to be able to recognize the different criteria used.

    Questions :

    • Do you have the sample number of differential gene for a FDR cutoff 0.1 and 0.05?
    • According to you, what is the best cutoff?
    • When you change the order of the Term1 and Term2, what are the consequences on the Results Table?

Second part with DESeq2 on PIVOT

  • Run the DESeq program for differential analysis - DESeq2
  • Check relevance of normalized expression values provided by DESeq2
  • Observe MDS plot of experimental conditions. Observe Smear plot.

Questions:

  • Using filters parameters, determine how many genes are found to be differentially expressed using a minimum pvalue <= 0.05, 0.1? Using a minimum FDR-adjusted pvalue <= 0.05, 0.1?

Step 6 : Differential Expression with DESeq2

Once data have been normalized in the Step 1, you can choose the method to find the Differential expression gene between the condition previously choosen.

  • Go to Differential Expression, choose DEseq2
  • According to your normalized method choice in step 1, notify the Data Normalization method, and choose Experiment Design, and condition.
  • For this dataset choose condition, condition
  • Choose the Test Method, Exact test, GLM likelihood ratio test, and GLM quasi-likelihood F-test.
  • For this dataset choose Exact test

Then Run Modeling

  • Choose the P adjustement adapted
  • For this dataset choose FDR cutoff
  • Choose FDR cutoff 0.1 and 0.05

You obtain a table containing the list of the differential gene expressed according to your designed analysis. This table contains baseMean, log2FoldChange, PValue, pvalueadjust.
This table can be downloaded in order to use it for other analysis. The MA Plot show you the up-regulated gene LFC>0, and Down regulated gene LFC<0, the differential gene are in red.

  • Keep this file deseq_results.csv
  • Change the name of the file according to your analysis.
  • Example : deseq_results_FDR_01.csv to be able to recognize the different criteria used.

    Questions :

    • Do you have the sample number of defferential gene for a FDR cutoff 0.1 and 0.05?
    • According to you, what is the best cutoff?
    • When you change the order of the Term1 and Term2, what are the consequence on the Results Table?

Step 7 : Visualisation Heatmap, Correlation, PCA

You can make correlation between the control and the rep in order to identify library bias if exists.
We can also explore the relationships between the samples by visualizing a heatmap of the correlation matrix.
The heatmap result corresponds to what we know about the data set.
First, the samples in group 1 and 2 come from the control and the repetition, so the two groups are very different.
Second, since the samples in each group are technical replicates, the within group variance is very low.

Correlation between sample and control

  • Go to Correlation.
  • 1 Pairwaise Scatterplot show the pairwaise comparison between your samples.
  • 2 Sample correlation with 3 methods, pearson, sperman or kendal, with the agglomeration method show you how are linked your samples.
  • 3 Feature Correlation represented with a heatMap ordered by variance of the expression.

PCA

  • To separate your sample a PCA is way to make a dimension reduction.

Question :

  • Can you describe the structure of your sample, with a correlation analysis, and a PCA?


Practice 4 : Compare list of DE genes with EdgeR and DESeq2

Practice 4 will be performed with Venn Diagramm implemented on PIVOT.

  • Compare lists of DE genes with the two approches.
    • Go to Toolkit.
    • Upload your lists of gene obtained with edegR edgeR_results_FDR_01.csv , and DESeq2 deseq_results_FDR_01.csv.
    • Upload the first list then Add List and upload the second list.
    • You can see the Venn diagram and download the common list of gene between the both methods.

Questions:

  • Look at the expression values for a gene found DE with EdgeR and not with DESeq2, and vice-versa, give the pvalue of each gene?

Some other tools are available to compare 2 lists of gene. Venny.


Practice 5 : Hierarchical Clustering

Practice 5 will be performed with PIVOT.

  • Connect to your PIVOT interface.
  • Go to Clustering.
    • For each analysis EdgeR or DESeq2 specify the Data Input (count, log...).
    • Choose the distance Euclidean or an other, the Agglomeration method Wardand the number of cluster.

Practice 6 : Visualization of mapped reads against genes using IGV

Practice 6 will be performed with Integrated Genome Viewer (IGV).

  • Focus on a gene that has been found to be differentially expressed and observe the structure of the gene.

  • From master0 qlogin -q formation.q

  • Lauch samtools index using bam obtained by hisat2:

for fl in ./*.bam; do samtools index $fl; done
  • Run igv : igv.sh &

  • Load reference genome, GFF annotation file, BAMs files and the gffCompare gffcompare_out.annotated.gtf output.

  • Recovery some ID to visualise it in IGV:

grep 'class_code "u"' gffcompare_out.annotated.gtf | less
  • Copy a identifiant, for example, "XLOC_000469" et and search it in IGV. Show this loci.

  • or :

grep 'class_code "x"' gffcompare_out.annotated.gtf | less
  • Search other gene, for example, "LOC_Os01g01710".

Links


License

The resource material is licensed under the Creative Commons Attribution 4.0 International License (here).