100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.6 TrustPilot
logo-home
Summary

Summary ALL Bioinformatics LESSONS + notes + commands

Rating
-
Sold
-
Pages
49
Uploaded on
22-03-2025
Written in
2024/2025

This document contains all the lessons detailed, from what was said in the lesson to all the commands used. It's copy paste for the exam!

Institution
Course

Content preview

Evolution: phylogenetics



Intro: typical workflow




Intro: example research questions
• When did an epidemic start?
• Where does a virus originate from?
• What factors explain epidemic growth?
• Heterogeneity in transmission (superspreaders)?
• Are hosts X, Y & Z epidemiologically linked?
• Of how many strains is the epidemic composed?
• Are strains associated with particular transmission routes?
• What adaptive changes have been accrued?
….

The blue-shaded graph illustrates how genome data has been complemented with
additional genomic datasets to track changes in population size over time in North America.

During the second phase of growth, there is a notable increase, which corresponds to the rising prevalence of injecting
drug use.

Molecular evolution
Analysing evolution tov time.
• Genetic information is stored in double stranded DNA (most organisms) or RNA (less
common, e.g. some viruses (e.g. HIV, influenza, SARS-CoV-2, …)
• This genetic information can change over time

Sequences that share more similar genetic material are more closely related to each other
than those from a different population or species.

Evolution is analyzed by looking backward in time. A set of genetic samples is compared, and by examining the
number and patterns of differences, scientists can infer evolutionary relationships and historical changes in
populations.
Replication inevitably results in genetic change VERTICAL evolution
• point mutations
• insertions and deletions (indels)

"The process of evolution, driven by genetic variation, is well captured through structural changes in the
genome."

Regarding vertical evolution:
"Vertical evolution refers to the transmission of genetic material from parent to offspring over generations, leading to
gradual changes in a species through mutation, recombination, and natural selection."

As we progress through the slides and move closer to the present, You can see how the process of evolution through
common descent with variation is beautifully represented in this branching tree structure. This model can then be used to
reconstruct evolutionary history from the past.
When we start with a set of samples or sequences, it should also be visually clear that presenting a phylogeny in this way
reflects the concept of vertical evolution.
The main mechanisms by which evolution generates genetic diversity are point mutations, insertions, and deletions.

A point mutation occurs when DNA or RNA polymerase incorporates an incorrect complementary nucleotide. For
example, if a base pair is mismatched (such as A pairing with C instead of A with T), a point mutation arises. Over time, this
process can lead to significant genetic variation.

Another way genetic diversity arises is through insertions and deletions. This happens when polymerases fail to
incorporate the exact number of nucleotides. Instead of adding just one nucleotide, two, three, or more may be inserted,
resulting in an insertion in the daughter strand relative to the parent strand. Conversely, if the polymerase skips a few
nucleotides during replication, a deletion occurs in the daughter strand compared to the parent strand.

,Data set compilation
• a high quality data set is a prerequisite for being able to draw meaningful conclusions
• what a good data set is primarily depends on the question at hand
• usually: own sampling combined with publicly available sequence data
• sequencing is getting cheaper and cheaper
• data sharing incentives:
1. most journals have strict data sharing policies
2. most scientists
3. all pathogen genetic sequencing data during outbreaks or other public
health emergencies must be made publicly available —> WHO Code of Conduct

Examples of such data bases include:
• NCBI GenBank (fasta) https://www.ncbi.nlm.nih.gov/genbank/
• NCBI SRA (fastq or bam) https://www.ncbi.nlm.nih.gov/sra
• NCVI Virus: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/
• NCBI Microbes: https://www.ncbi.nlm.nih.gov/genome/microbes/
• European Nucleotide Archive: https://www.ebi.ac.uk/ena/browser/home
• EnsemblBacteria: https://bacteria.ensembl.org/index.html
• DNA DataBank of Japan: https://www.ddbj.nig.ac.jp/index-e.html
• Other specialty
• GISAID (viruses): https://www.gisaid.org/
• BacWGSTdb (bacteria): http://bacdb.cn/BacWGSTdb/
• PlasmoDB (plasmodium): https://plasmodb.org/plasmo/app/

Metadata matter
Minimum information needed:
- date of sample collection and/or onset of patient symptoms yyyy-mm-dd: only the year is not sufficient —
precision matters
Location of sample collection and/or location of exposure:
- country level minimum; higher resolution when possible (ethical considerations)
- host and sample type used to generate the sequence e.g., human serum, tissue, environmental, vector, etc
Additional information
- patient: symptoms, outcomes, treatment/vaccination, PCR test results
- passage history
- sample collection methods
- sequencing & data processing methods (access to raw data)




Left: without much precision.

• What are we looking for? Homologous sequences (sequences are homologous when they share
a common ancestor) = gemeenschappelijke voorouder
• How? Look for excess similarity compared to expectation under random process

→ Goal: maximise similarity. The higher the similarity, the higher the alignment score.
Need to compare the genetic material: positional homology sequences are rearranged such that homologues
nucleotides form columns
How to reach positional homology in practice?
- give a reward for a match
- give a penalty for a mismatch
score(i,j) = λ * log(qij/pi * pj )
-hypothesis 1 (the evolutionary model): qij: probability that nucleotides are
aligned because they’re homologous
-hypothesis 2 (the random model): pi * pj: probability that nucleotides are
aligned in a random alignment
-λ: scaling factor

To show you that with a minimal set of assumptions we can easily devise our
own scoring schemes that are optimal for a settlement in this case 88%
identity alignment.
! it's not that the teachers will ask you to devise your own scoring scheme.

, These matches or mismatches also contribute to genetic diversity,
which is further generated through insertions and deletions. In a
sequence alignment, these appear as gaps. However, unlike matches
and mismatches, which have a straightforward analytical solution,
multiple analytical methods exist for handling gaps.

Fortunately, we can also approach this biologically. When examining
the coding regions of the genome, we know that evolution more often
occurs through the addition or deletion of nucleotides in groups of
three rather than as single nucleotides. This is because the genetic
coding system is based on codons, which consist of three nucleotides. Changes in multiples of three disrupt the protein
code less than single insertions or deletions, which can cause frameshift mutations.

This principle can be incorporated into a scoring model for alignments, as shown here. For example, if you assign a fixed
penalty of -5 to each gap (-), the total score remains the same regardless of how the gaps are placed in the alignment.
However, by using a gap opening penalty and a gap extension penalty, you can assign values that better reflect biological
reality, resulting in a more accurate scoring model.

Database querying in practice: BLAST
Software designed to look up similar sequencing.

Data set compilation




Alignment editing
- always visually inspect an alignment/output
- for coding sequences, the amino acid alignment can guide the nucleotide alignment (particularly useful for divergent
sequences)
- sections that cannot be unambiguously aligned should be removed (avoid non-homology)
- editing can be done manually (always somewhat subjective) or in an automated fashion (more objective)
→ This process can be quite lengthy and time-consuming, especially when dealing with a large number of sequences that
need to be arranged in a matrix. It becomes even more tedious when a new set of sequences becomes available, requiring
you to start over from scratch and wait for the multiple sequence alignment (MSA) to be completed again.

An alternative approach is to create an MSA for the newly added data and
then align this new MSA with the existing one. This is done by generating
profiles for both alignments and comparing them.

There are various methods for performing profile-based multiple sequence
alignments.

On the right side of the slide, there is an example where an alignment with
hundreds of sequences is summarized in just four lines by representing the
percentage of each nucleotide at each position. This significantly reduces
computational requirements and speeds up the process of incorporating new data into an
existing alignment.

Automated alignment editing Depending on the score of a column in alignment (e.g. gap or
similarity score), column is kept or not.
Sequence file formats
- FASTA: Simple format for nucleotide/protein sequences with a header line
starting with ">".
- NEXUS: Complex format for phylogenetic analysis, includes sequence
data and metadata.
- PHYLIP: Format for sequence alignments used in phylogenetic analysis.

, Phylogenies




• neighbor-joining is based on the minimum evolution criterion i.e. the topology that gives the least total branch length is
preferred at each step of the algorithm.
• though the heuristic does not guarantee to find the tree topology with least total branch length
• even though it is sub-optimal in this sense, it has been extensively tested and usually finds a tree that is quite close to the
optimal tree
• very fast
LES 1 2025 – NJ: snel een fylogenetische boom maken
Database querying with BLAST, multiple sequence alignment and phylogenetic reconstruction
After this exercise you will be able to
• query the GenBank database via the web interface and through the command line
• look up sequences similar to a query in a database with BLAST via the command line
• create and edit a multiple sequence alignment
• reconstruct a phylogeny with the NJ algorithm
For this exercise we revisit an analysis of a Yellow Fever Virus (YFV) genome generated from a Dutch patient with a travel
history from Suriname. When this patient got infected in 2017, it represented the first YFV case notified from Suriname in
decades. Around that time, there also was an outbreak of yellow fever in Brazil. The Dutch patient's YFV genome was used
to identify a possible origin for the virus and to evaluate whether the reported YFV was introduced from the ongoing 2016-
2017 Brazilian outbreak. The original report can be found in this post on virological.org.
The workflow is as follows:
1. look up the Dutch patient's YFV sequence
2. fetch a set of highly similar sequences
3. create and edit the multiple sequence alignment
4. reconstruct a NJ tree

Before diving in the data analysis, set up the environment. We begin by loading the Python modules to be used.
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import Entrez, SeqIO, Phylo, AlignIO
import pandas as pd
import time, os, alv, re
from IPython.display import Image
Load the interface to run R embedded in a Python process:
%load_ext rpy2.ipython
[R] Load R packages to be used:
If you work wit %%R than you're not working with python anymore
%%R
library("ape")
Define global variables for the input names that will only have to be changed once.
baseName = "Dutch_patient_YFV"
folderName = "exercise_YFV"
Push that variable to R using the cell magic Rpush
%Rpush baseName folderName
Let's check whether this actually works:
%%R
print(paste0("baseName: ", baseName, " -- folderName: ", folderName))
[1] "baseName: Dutch_patient_YFV -- folderName: exercise_YFV"
Variables can be loaded into bash using the option -s

NOTE
It will become as if it is an input variable of a script that is loaded. This means that variables aren't named any more but are
indexed. Hence, to reference to such variables, use $1 instead of $baseName and $2instead of $folderName

[bash] Let's also check this for bash:
%%bash -s "$baseName" "$folderName"
echo baseName: "$1" -- folderName: "$2"
baseName: Dutch_patient_YFV -- folderName: exercise_YFV

Create a folder where analysis files can saved:
%%bash -s "$folderName"
mkdir "$1" # if folder already exists and you wish to overwrite, you can try $mkdir -p "$1"

Written for

Institution
Study
Course

Document information

Uploaded on
March 22, 2025
Number of pages
49
Written in
2024/2025
Type
SUMMARY

Subjects

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
Lorejansens123 Odisee Hogeschool
Follow You need to be logged in order to follow users or courses
Sold
135
Member since
4 year
Number of followers
34
Documents
45
Last sold
1 week ago

4.1

14 reviews

5
4
4
8
3
1
2
1
1
0

Trending documents

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions