Leerdoelen Genomica
HC1: Intro, BLAST
Why study bioinformatics?
Explain why a biologist should know Bioinformatic Data Analysis
Describe the ‘omics: (meta-) genomics, (meta-) transcriptomics, (meta-) proteomics,
metabolomics, etc.
Genomics: Sequence all of the DNA of one organism
Transcriptomics: Sequence all of the mRNA in an organism/tissue/cell
Proteomics: Sequence all of the proteins in an organism/tissue/cell
Metagenomics: Sequence the DNA of all organisms in a sample
Metatranscriptomics: Sequence the mRNA of all organisms in a sample
Metaproteomics: Sequence the proteins of all organisms in a sample
Explain the biology behind the ‘omics revolution: reduce bias by measuring all of a thing
Omics solves a major problem in science: biases
- People are mostly interested in: 1. Their diseases 2. Their food 3. Themselves
- This causes biases in our general understanding of biology, and biases in our databases
- For example, most studied bacteria are associated with humans
Compare the two ways a bioinformatician exploits existing data to make new discoveries
(top-down and bottom-up)
Sequence similarity searches
Explain what a sequence alignment is and the difference between a global and local
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or
protein to identify regions of similarity that may be a consequence of functional, structural,
or evolutionary relationships between the sequences. Aligned sequences of nucleotide or
amino acid residues are typically represented as rows within a matrix. Gaps are inserted
between the residues so that identical or similar characters are aligned in successive
columns.
Local alignment – Finds the optimal sub-alignment within two sequences – Partial homologs
Global alignment – Aligns two sequences from end to end – If you know two sequences are
full homologs, e.g. resulting from gene duplication.
Explain the BLAST algorithm
1. Identifies all words (length W) in the query – Default lengths: W = 3 for protein, W = 11
for DNA
– Based on substitution scores
2. Quickly finds similar words in the database – “Similar” words are defined by using the
substitution matrix (e.g. BLOSUM62) – The index quickly locates all potential hit seqs
, 3. Extends seeds in both directions to find HSPs between query and hit – HSP: region that
can be aligned with a score above a certain threshold
List the factors including heuristics that make BLAST fast
The fastest algorithms generally use heuristics Heuristic: a practical method that is not
guaranteed to be optimal, but sufficient for the present goals.
Running blast
Evaluate BLAST output/results
Decide which BLAST flavor to use for your similarity search
BLAST flavors: direct searches
o Nucleotide-nucleotide searches
- Nucleotide database & nucleotide query
- blastn (default: W = 11 nucleotides)
Find homologous genes in different species
- Megablast (default: W = 28 nucleotides)
Designed to efficiently find longer alignments between very similar
nucleotide sequences
Best tool to find highly identical hits for a query sequence • For
example: find sequences from the same species
- Discontiguous Megablast
Uses discontiguous words (e.g. W = 11 nucleotides: AT-GT-AC-CG-CG-T)
For example, this can focus the search on codons (the third nucleotide
of codons is less conserved due to the degeneracy of the genetic code)
Best tool to find nucleotide-nucleotide hits at larger evolutionary
distances for proteincoding query sequences.
o Protein-protein searches
- Protein database & protein query sequences
- blastp (default: W = 3 amino acids)
Find homologous proteins in different species
BLAST flavors: translated searches
o We can exploit the conservation of protein sequences when aligning DNA sequences, by
using translated searches
o This allows for more sensitive searches that detect homology at greater evolutionary
distances
– For example: homologous genes in distantly related species
o blastx and tblastx first translate the query from nucleotide into protein before identifying
high-scoring words
o tblastn and tblastx use a translated database of nucleotide sequences stored as proteins
, HC 2 Quantifying Sequence Similarity
Evolution
List the mechanisms of DNA mutation
Nucleotide substitutions
- Replication error
- Physical or chemical reaction
Insertions or deletions (indels)
- Unequal crossing over during meiosis
- Replication slippage
Inversions or rearrangements
Duplications of:
- Partial or whole gene
- Partial (polysomy) or whole chromosome (aneuploidy, polysomy)
- Whole genome (polyploidy)
Horizontal gene transfer (HGT)
- Transfer between individuals of the same generation
Define homology, similarity, and identity
Homology
- Property of two sequences that have a shared ancestor
- Homology is TRUE or FALSE: either you’re family or you’re not
Identity
- Percentage of identical residues in an alignment
- Used for amino acids or nucleotides.
Similarity
- Percentage of amino acid residues in an alignment with a positive substitution score-
- Not used for DNA
List four properties of amino acids that might be important in determining their physico-
chemical similarity
Size, polarity, hydrophobicity, preferred protein fold
Probability & Permutation Statistics
Work with P-values obtained using permutation statistics
P-value: defined as the probability of observing a hit as good as, or better than your score by
chance.
In permutation statistics -> corresponds to the fraction of times that the permuted score is
equal or higher than your score.
Meaningful observation -> low P-value -> if randomly permuted data rarely has a higher
score
The minimum P-value depends on the number of random permutations.
Example: for 100 permutations, the best P-value: <0.01
For 1000 permutations, the best P-value: <0.001
Explain how permutation statistics help us evaluate the strength of a result
Statistics are not well defined for many bioinformatic analyses. A simple solution is data
permutation:
- Permute (shuffle) the sequences 1000* times
- Make 1000* new alignment matrices
- Register if the alignment score of the permuted sequences is equal or higher than
Your Score
HC1: Intro, BLAST
Why study bioinformatics?
Explain why a biologist should know Bioinformatic Data Analysis
Describe the ‘omics: (meta-) genomics, (meta-) transcriptomics, (meta-) proteomics,
metabolomics, etc.
Genomics: Sequence all of the DNA of one organism
Transcriptomics: Sequence all of the mRNA in an organism/tissue/cell
Proteomics: Sequence all of the proteins in an organism/tissue/cell
Metagenomics: Sequence the DNA of all organisms in a sample
Metatranscriptomics: Sequence the mRNA of all organisms in a sample
Metaproteomics: Sequence the proteins of all organisms in a sample
Explain the biology behind the ‘omics revolution: reduce bias by measuring all of a thing
Omics solves a major problem in science: biases
- People are mostly interested in: 1. Their diseases 2. Their food 3. Themselves
- This causes biases in our general understanding of biology, and biases in our databases
- For example, most studied bacteria are associated with humans
Compare the two ways a bioinformatician exploits existing data to make new discoveries
(top-down and bottom-up)
Sequence similarity searches
Explain what a sequence alignment is and the difference between a global and local
sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or
protein to identify regions of similarity that may be a consequence of functional, structural,
or evolutionary relationships between the sequences. Aligned sequences of nucleotide or
amino acid residues are typically represented as rows within a matrix. Gaps are inserted
between the residues so that identical or similar characters are aligned in successive
columns.
Local alignment – Finds the optimal sub-alignment within two sequences – Partial homologs
Global alignment – Aligns two sequences from end to end – If you know two sequences are
full homologs, e.g. resulting from gene duplication.
Explain the BLAST algorithm
1. Identifies all words (length W) in the query – Default lengths: W = 3 for protein, W = 11
for DNA
– Based on substitution scores
2. Quickly finds similar words in the database – “Similar” words are defined by using the
substitution matrix (e.g. BLOSUM62) – The index quickly locates all potential hit seqs
, 3. Extends seeds in both directions to find HSPs between query and hit – HSP: region that
can be aligned with a score above a certain threshold
List the factors including heuristics that make BLAST fast
The fastest algorithms generally use heuristics Heuristic: a practical method that is not
guaranteed to be optimal, but sufficient for the present goals.
Running blast
Evaluate BLAST output/results
Decide which BLAST flavor to use for your similarity search
BLAST flavors: direct searches
o Nucleotide-nucleotide searches
- Nucleotide database & nucleotide query
- blastn (default: W = 11 nucleotides)
Find homologous genes in different species
- Megablast (default: W = 28 nucleotides)
Designed to efficiently find longer alignments between very similar
nucleotide sequences
Best tool to find highly identical hits for a query sequence • For
example: find sequences from the same species
- Discontiguous Megablast
Uses discontiguous words (e.g. W = 11 nucleotides: AT-GT-AC-CG-CG-T)
For example, this can focus the search on codons (the third nucleotide
of codons is less conserved due to the degeneracy of the genetic code)
Best tool to find nucleotide-nucleotide hits at larger evolutionary
distances for proteincoding query sequences.
o Protein-protein searches
- Protein database & protein query sequences
- blastp (default: W = 3 amino acids)
Find homologous proteins in different species
BLAST flavors: translated searches
o We can exploit the conservation of protein sequences when aligning DNA sequences, by
using translated searches
o This allows for more sensitive searches that detect homology at greater evolutionary
distances
– For example: homologous genes in distantly related species
o blastx and tblastx first translate the query from nucleotide into protein before identifying
high-scoring words
o tblastn and tblastx use a translated database of nucleotide sequences stored as proteins
, HC 2 Quantifying Sequence Similarity
Evolution
List the mechanisms of DNA mutation
Nucleotide substitutions
- Replication error
- Physical or chemical reaction
Insertions or deletions (indels)
- Unequal crossing over during meiosis
- Replication slippage
Inversions or rearrangements
Duplications of:
- Partial or whole gene
- Partial (polysomy) or whole chromosome (aneuploidy, polysomy)
- Whole genome (polyploidy)
Horizontal gene transfer (HGT)
- Transfer between individuals of the same generation
Define homology, similarity, and identity
Homology
- Property of two sequences that have a shared ancestor
- Homology is TRUE or FALSE: either you’re family or you’re not
Identity
- Percentage of identical residues in an alignment
- Used for amino acids or nucleotides.
Similarity
- Percentage of amino acid residues in an alignment with a positive substitution score-
- Not used for DNA
List four properties of amino acids that might be important in determining their physico-
chemical similarity
Size, polarity, hydrophobicity, preferred protein fold
Probability & Permutation Statistics
Work with P-values obtained using permutation statistics
P-value: defined as the probability of observing a hit as good as, or better than your score by
chance.
In permutation statistics -> corresponds to the fraction of times that the permuted score is
equal or higher than your score.
Meaningful observation -> low P-value -> if randomly permuted data rarely has a higher
score
The minimum P-value depends on the number of random permutations.
Example: for 100 permutations, the best P-value: <0.01
For 1000 permutations, the best P-value: <0.001
Explain how permutation statistics help us evaluate the strength of a result
Statistics are not well defined for many bioinformatic analyses. A simple solution is data
permutation:
- Permute (shuffle) the sequences 1000* times
- Make 1000* new alignment matrices
- Register if the alignment score of the permuted sequences is equal or higher than
Your Score