Introduction to Nanopore Sequencing¶
In this tutorial we will assemble the E. coli genome using a mix of long, error-prone reads from the MinION (Oxford Nanopore) and short reads from a HiSeq instrument (Illumina).
The MinION data used in this tutorial come a test run by the Loman lab.
The Illumina data were simulated using InSilicoSeq
Get the Data¶
First download the nanopore data
wget http://s3.climb.ac.uk/nanopore/ecoli_allreads.fasta
You will not need the HiSeq data right away, but you can start the download in another window
curl -O -J -L https://osf.io/pxk7f/download curl -O -J -L https://osf.io/zax3c/download
look at basic stats of the nanopore reads
assembly-stats ecoli_allreads.fasta
Question
How many nanopore reads do we have?
Question
How long is the longest read?
Question
What is the average read length?
Adapter trimming¶
The guppy basecaller, i.e. the program that transform raw electrical signal in fastq files, already demultiplex and trim for us.
Assembly¶
We assemble the reads using wtdbg2 (version > 2.3)
head -n 20000 ecoli_allreads.fasta > subset.fasta
wtdbg2 -x ont -i subset.fasta -fo assembly
wtpoa-cns -i assembly.ctg.lay.gz -fo assembly.ctg.fa
Polishing¶
Since the assembly likely contains a lot of errors, we correct it with Illumina reads.
First we map the short reads against the assembly
bowtie2-build assembly.ctg.fa assembly bowtie2 -x assembly -1 ecoli_hiseq_R1.fastq.gz -2 ecoli_hiseq_R2.fastq.gz | \ samtools view -bS -o assembly_short_reads.bam samtools sort assembly_short_reads.bam -o assembly_short_sorted.bam samtools index assembly_short_sorted.bam
then we run the consensus step
samtools view assembly_short_sorted.bam | wtpoa-cns -t 16 -x sam-sr \ -d assembly.ctg.fa -i - -fo assembly_polished.fasta
which will correct eventual misamatches in our assembly and write the new improved assembly to assembly_polished.fasta
For better results we should perform more than one round of polishing.
Compare with the existing assembly and an illumina only assembly¶
an existing assembly¶
Go to https://www.ncbi.nlm.nih.gov and search for NC_000913.
Download the associated genome in fasta format and rename it to ecoli_ref.fasta
nucmer --maxmatch -c 100 -p ecoli assembly_polished.fasta ecoli_ref.fasta
mummerplot --fat --filter --png --large -p ecoli ecoli.delta
then take a look at ecoli.png
compare metrics¶
Note
First you need to assemble the illumina data
Then run busco and quast on the 3 assemblies
Question
which assembly would you say is the best?
Annotation¶
If you have time, train your annotation skills by running prokka on your genome!
prokka --outdir annotation --kingdom Bacteria assembly_polished.fasta
You can open the output to see how it went
cat annotation/*.txt
Question
Does it fit your expectations? How many genes were you expecting?