View markdown source on GitHub

Mapping

Contributors

Questions

Objectives

last_modification Last modification: Feb 25, 2022

Example NGS pipeline

High level view of a typical NGS workflow

A high level view of a typical NGS bioinformatics workflow

Speaker Notes


What is mapping?

.pull-left[ Mapping vs assembly ]

.pull-right[

Speaker Notes


class: top

Sequence alignment

Speaker Notes


class: top

Sequence alignment

Speaker Notes But if we introduce gaps and allow for some mismatches in bases, this matches up pretty well..

Speaker Notes

Some reads may map to multiple locations

We want a way to determine best alignment if none are perfect matches..


class: top

Alignment Scoring (basics)

.center[ .image-25[Screenshot of a sequence scoring game where two sequences are being aligned across the top (GGCTGG and GAGG) and the per-base and cumulative scores from left to right.]

Example (with affine gap penalty) ]

Speaker Notes


class: top

Alignment Scoring (advanced)

.center[ .image-50[ Transitions vs transversions ] .image-25[ Example scoring matrix ] ]

.footnote[More information about mapping algorithms: 10.1089/cmb.2012.0022]

Speaker Notes Many more complexities may be considered, different tools make different choices

Transitions are more likely to occur in real sequences, so may give lower penalty than transversions

Transitions are interchanges of two-ring purines (A G) or of one-ring pyrimidines (C T): they therefore involve bases of similar shape.

Transversions are interchanges of purine for pyrimidine bases, which therefore involve exchange of one-ring and two-ring structures.

Transitions and transversions


Looks easy but..


class: top

Sequence Alignment

Reference: AAA CAGTGA GAA
Observed:  AAA TCTCT  GAA

Speaker Notes Suppose we want to map this read (bottom) to this reference sequence (top)


class: top

Sequence Alignment

Reference: AAA CAGTGA GAA
Observed:  AAA TCTCT  GAA
Alignment
AAA-CAGTGAGAA
|||-|--|::|||
AAATC--TCTGAA
Maybe like this?

Speaker Notes This is one possibility, is it the only one?


class: top

Sequence Alignment

Reference: AAA CAGTGA GAA
Observed:  AAA TCTCT  GAA
Alignment
AAA-CAGTGAGAA
|||-|--|::|||
AAATC--TCTGAA
Maybe like this?
AAACAGTGAGAA
|||-::|::|||
AAA-TCTCTGAA
Or this?

Speaker Notes This is also a possible alignment. Not easy to say which is better.


class: top

Sequence Alignment

Reference: AAA CAGTGA GAA
Observed:  AAA TCTCT  GAA
Alignment
AAA-CAGTGAGAA
|||-|--|::|||
AAATC--TCTGAA
Maybe like this?
AAACAGTGAGAA
|||-::|::|||
AAA-TCTCTGAA
Or this?
AAACAGTGAGAA
|||:-:|::|||
AAAT-CTCTGAA

Or..?

Speaker Notes And a third option


class: top

Sequence Alignment

Reference: AAA CAGTGA GAA
Observed:  AAA TCTCT  GAA
Alignment
AAA-CAGTGAGAA
|||-|--|::|||
AAATC--TCTGAA
Maybe like this?
AAACAGTGAGAA
|||-::|::|||
AAA-TCTCTGAA
Or this?
AAACAGTGAGAA
|||:-:|::|||
AAAT-CTCTGAA

Or..?
AAACAGTCA-----GAA
|||-----------|||
AAA------TCTCTGAA
What about this?

Speaker Notes There is no one right way to do alignment

Mapping is a non-trivial problem!


class: top

Sequence Alignment

Reference: AAA CAGTGA GAA
Observed:  AAA TCTCT  GAA
AlignmentTool
AAA-CAGTGAGAA
|||-|--|::|||
AAATC--TCTGAA
Novoalign
AAACAGTGAGAA
|||-::|::|||
AAA-TCTCTGAA
Ssaha2
AAACAGTGAGAA
|||:-:|::|||
AAAT-CTCTGAA

BWA
AAACAGTCA-----GAA
|||-----------|||
AAA------TCTCTGAA
Complete Genomics

Speaker Notes We didn’t just make these up, these real aligners gave these different results


class: top

Sequence Alignment

Reference: AAA CAGTGA GAA
Observed:  AAA TCTCT  GAA
AlignmentVariant calls
AAA-CAGTGAGAA
|||-|--|::|||
AAATC--TCTGAA
ins T
del AG
sub GA -> CT
AAACAGTGAGAA
|||-::|::|||
AAA-TCTCTGAA
del C
sub AG -> TC
sub GA -> CT
AAACAGTGAGAA
|||:-:|::|||
AAAT-CTCTGAA

snp C -> T
del A
snp G -> C
sub GA -> CT
AAACAGTGA-----GAA
|||-----------|||
AAA------TCTCTGAA
del CAGTGA
ins TCTCT

Speaker Notes Important: Mapping can affect downstream analysis!

These different mappings led to different variants, and hard to tell they are equivalent.


Try it yourself!

.image-75[Recording of alignment game]

.footnote[https://tinyurl.com/sequence-alignment]

Speaker Notes Can have learners play around with this alignment game now

Or use Lego bricks, each nucleotide a different colour


Paired-end sequencing

Speaker Notes


class: top

Repeats

Speaker Notes In the case of repeats, a single-end read alone would not have be enough for unique mapping..

Speaker Notes But with the additional information provided by paired-end protocol (distance to mate), this can now be resolved..


class: top

InDels (Insertions / Deletions)

Speaker Notes

FAQ: “What about mate-pair sequencing?”


class: top

Paired-end FASTQ files

Speaker Notes When you have paired-end data, you will usually get 2 files.

Pairing also visible in read names

Speaker Notes Sometimes data can be in a single interleaved file (aka interlaced)


class: top

Paired-end FASTQ files

Speaker Notes Most tools blindly assume that first read in forward file is paired with first read in reverse file etc

Otherwise too slow

When trimming and filtering, if a read is removed from one file, its mate must be removed from other one too!

Always trim together in paired-end mode!

.pull-left[ .red[

@PAIR-1 forward
GGGTGATGGCCGCTGCCGATGGCGTCAAAT
+
))%255CCF>>>>>>CCCCCCC65`IIII%

] .orange[

@PAIR-2 forward
GATTTGGGGTTCAAAGCAGTATCGATCAA
+
!''3((((^^d+))%%%++)(%%%%).1)

] .blue[

@PAIR-3 forward
TCGCACTCAACGCCCTGCATATGACAAGAC
+
A64;##=#B9=AAAAAAAAAA9#:AB95%^

]

mysample_R1.fastq ]

.pull-right[ .red[

@PAIR-1 reverse
AAGTTACCCTTAACAACTTAAGGGTTTTCA
+
fffddf`feedB`IABa)^%YBBBRTT\^d

] .orange[

@PAIR-2 reverse
AGCAGAAGTCGATGATAATACGCGTCGTTT
+
IIIIIII^^IIId`?III%IIIGII>IIII

] .blue[

@PAIR-3 reverse
AATCCATTTGTTCAACTCACAGTTTACCGT
+
9C;=;=<9@4868>9:67AA<9>65<=>59

] mysample_R2.fastq ]

Speaker Notes


class: top

Paired-end FASTQ files

.pull-left[ .red[

@PAIR-1 forward
GGGTGATGGCCGCTGCCGATGGCGTCAAAT
+
))%255CCF>>>>>>CCCCCCC65`IIII%

] .left[] .orange[

@PAIR-2 forward
GATTTGGGGTTCAAAGCAGTATCGATCAA
+
!''3((((^^d+))%%%++)(%%%%).1)

] .blue[

@PAIR-3 forward
TCGCACTCAACGCCCTGCATATGACAAGAC
+
A64;##=#B9=AAAAAAAAAA9#:AB95%^

]

mysample_R1.fastq ] .pull-right[ .red[

@PAIR-1 reverse
AAGTTACCCTTAACAACTTAAGGGTTTTCA
+
fffddf`feedB`IABa)^%YBBBRTT\^d

] .orange[

@PAIR-2 reverse
AGCAGAAGTCGATGATAATACGCGTCGTTT
+
IIIIIII^^IIId`?III%IIIGII>IIII

] .blue[

@PAIR-3 reverse
AATCCATTTGTTCAACTCACAGTTTACCGT
+
9C;=;=<9@4868>9:67AA<9>65<=>59

] mysample_R2.fastq ]

Speaker Notes

Paired-end FASTQ files

.pull-left[ .red[

@PAIR-1 forward
GGGTGATGGCCGCTGCCGATGGCGTCAAAT
+
))%255CCF>>>>>>CCCCCCC65`IIII%

] .blue[

@PAIR-3 forward
TCGCACTCAACGCCCTGCATATGACAAGAC
+
A64;##=#B9=AAAAAAAAAA9#:AB95%^

] .green[

@PAIR-4 forward
AAACTTCGTAGGTCCATTTGACAGCGTGCA
+
6664%!!III^(=%3333^^d^d:#32333

] mysample_R1.fastq ] .pull-right[ .red[

@PAIR-1 reverse
AAGTTACCCTTAACAACTTAAGGGTTTTCA
+
fffddf`feedB`IABa)^%YBBBRTT\^d

] .orange[

@PAIR-2 reverse
AGCAGAAGTCGATGATAATACGCGTCGTTT
+
IIIIIII^^IIId`?III%IIIGII>IIII

] .blue[

@PAIR-3 reverse
AATCCATTTGTTCAACTCACAGTTTACCGT
+
9C;=;=<9@4868>9:67AA<9>65<=>59

] mysample_R2.fastq ]

Speaker Notes By cutting the yellow read only from the forward reads file, but leaving the other side of pair in the other file, an incorrect pairing is now assumed by downstream tools


Choosing an Aligner

.center[ .image-40[Mapping RNA] ] .footnote[Figure: mapping of RNA-seq reads is different than DNA-seq]

Speaker Notes Choice of mapper depends on your experiment

Or other factors

FAQ: “Why not map RNA reads to the transcriptome?”

FAQ: “Why not BLAST or BLAT?”


Know your data!

“… there is no tool that outperforms all of the others in all the tests. Therefore, the end user should clearly specify [their] needs in order to choose the tool that provides the best results.” - Hatem et al BMC Bioinformatics 2013, 14:184

.footnote[ DOI: 10.1186/1471-2105-14-184 ]

Speaker Notes

Know the data you are working with and pick the right mapper and parameters for the job!

Not an easy task..


class: top

Mapping tools

Timeline of mapping tools

.footnote[60+ different mappers, many comparison papers. Figure from 10.1093/bioinformatics/bts605 ]

Speaker Notes

Many different tools available

Different strengths and weaknesses, comparison table in link


class: top

Mapping tools

Mapping tool Uses Characteristics
HISAT2 DNA/RNA Short reads. Based on GCSA. Reference.
RNASTAR RNA Short reads. Extremely fast. High sensitive and accuracy. Based on Maximal Mappable Prefixes (MMPs). Reference.
BWA-MEM2 DNA Short reads. Twice as faster as BWA-MEM. Memory efficient. Based on Burrows-Wheeler. Reference.
Minimap2 DNA/RNA Long reads (PacBio and ONT). Extremely fast. Based on DALIGN and MHAP. Reference.
Bismark DNA/RNA Short reads. Bisulfite treated sequencing. Based on GCSA. Reference.
BBMap DNA/RNA Short and long reads (PacBio and ONT). Memory demanding. Reference.
Whisper 2 DNA Short reads. Indel sensitive. Variant-calling oriented. Reference.
S-conLSH DNA Long reads (ONT). High sensitivity and accuracy. Reference.

File Formats


SAM/BAM file format

Example of SAM file format

SAM: Sequence Alignment Map

BAM: Binary (compressed) SAM; not human-readable


SAM/BAM file format

More detailed view of SAM format

Speaker Notes Alignment given in CIGAR string.


class: top

Genome Browsers

IGV Genome Browser

.footnote[This is IGV (Integrative Genome Browser) DOI: 10.1038/nbt.1754]

Speaker Notes


class: top

Genome Browsers in Galaxy

.image-90[Screenshot of JBrowse in the Galaxy Interface showing transcripts, various box plots, heatmaps, sequencing depth and variation plots.]

.footnote[JBrowse.org DOI: 10.1186/s13059-016-0924-1]

Speaker Notes

Jbrowse tool builds up a small website for you, and pre-processes the reference genome into a more efficient format. If you wanted to share this with your colleagues, you could download this dataset and directly place it on your webserver.


class: top

Genome Browsers in Galaxy

Speaker Notes In the mapping hands-on tutorial you will use JBrowse and IGV


Key Points

Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! Galaxy Training Network This material is licensed under the Creative Commons Attribution 4.0 International License.