100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4,6 TrustPilot
logo-home
Samenvatting

Samenvatting ALLE LESSEN Bioinformatica + notities + commando's

Beoordeling
-
Verkocht
-
Pagina's
49
Geüpload op
22-03-2025
Geschreven in
2024/2025

Dit document bevat alle lessen uitgewerkt, van wat er gezegd werd in de les tot alle gebruikte commando's. Het is copy pasten op het examen!

Voorbeeld van de inhoud

Evolution: phylogenetics



Intro: typical workflow




Intro: example research questions
• When did an epidemic start?
• Where does a virus originate from?
• What factors explain epidemic growth?
• Heterogeneity in transmission (superspreaders)?
• Are hosts X, Y & Z epidemiologically linked?
• Of how many strains is the epidemic composed?
• Are strains associated with particular transmission routes?
• What adaptive changes have been accrued?
….

The blue-shaded graph illustrates how genome data has been complemented with
additional genomic datasets to track changes in population size over time in North America.

During the second phase of growth, there is a notable increase, which corresponds to the rising prevalence of injecting
drug use.

Molecular evolution
Analysing evolution tov time.
• Genetic information is stored in double stranded DNA (most organisms) or RNA (less
common, e.g. some viruses (e.g. HIV, influenza, SARS-CoV-2, …)
• This genetic information can change over time

Sequences that share more similar genetic material are more closely related to each other
than those from a different population or species.

Evolution is analyzed by looking backward in time. A set of genetic samples is compared, and by examining the
number and patterns of differences, scientists can infer evolutionary relationships and historical changes in
populations.
Replication inevitably results in genetic change VERTICAL evolution
• point mutations
• insertions and deletions (indels)

"The process of evolution, driven by genetic variation, is well captured through structural changes in the
genome."

Regarding vertical evolution:
"Vertical evolution refers to the transmission of genetic material from parent to offspring over generations, leading to
gradual changes in a species through mutation, recombination, and natural selection."

As we progress through the slides and move closer to the present, You can see how the process of evolution through
common descent with variation is beautifully represented in this branching tree structure. This model can then be used to
reconstruct evolutionary history from the past.
When we start with a set of samples or sequences, it should also be visually clear that presenting a phylogeny in this way
reflects the concept of vertical evolution.
The main mechanisms by which evolution generates genetic diversity are point mutations, insertions, and deletions.

A point mutation occurs when DNA or RNA polymerase incorporates an incorrect complementary nucleotide. For
example, if a base pair is mismatched (such as A pairing with C instead of A with T), a point mutation arises. Over time, this
process can lead to significant genetic variation.

Another way genetic diversity arises is through insertions and deletions. This happens when polymerases fail to
incorporate the exact number of nucleotides. Instead of adding just one nucleotide, two, three, or more may be inserted,
resulting in an insertion in the daughter strand relative to the parent strand. Conversely, if the polymerase skips a few
nucleotides during replication, a deletion occurs in the daughter strand compared to the parent strand.

,Data set compilation
• a high quality data set is a prerequisite for being able to draw meaningful conclusions
• what a good data set is primarily depends on the question at hand
• usually: own sampling combined with publicly available sequence data
• sequencing is getting cheaper and cheaper
• data sharing incentives:
1. most journals have strict data sharing policies
2. most scientists
3. all pathogen genetic sequencing data during outbreaks or other public
health emergencies must be made publicly available —> WHO Code of Conduct

Examples of such data bases include:
• NCBI GenBank (fasta) https://www.ncbi.nlm.nih.gov/genbank/
• NCBI SRA (fastq or bam) https://www.ncbi.nlm.nih.gov/sra
• NCVI Virus: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/
• NCBI Microbes: https://www.ncbi.nlm.nih.gov/genome/microbes/
• European Nucleotide Archive: https://www.ebi.ac.uk/ena/browser/home
• EnsemblBacteria: https://bacteria.ensembl.org/index.html
• DNA DataBank of Japan: https://www.ddbj.nig.ac.jp/index-e.html
• Other specialty
• GISAID (viruses): https://www.gisaid.org/
• BacWGSTdb (bacteria): http://bacdb.cn/BacWGSTdb/
• PlasmoDB (plasmodium): https://plasmodb.org/plasmo/app/

Metadata matter
Minimum information needed:
- date of sample collection and/or onset of patient symptoms yyyy-mm-dd: only the year is not sufficient —
precision matters
Location of sample collection and/or location of exposure:
- country level minimum; higher resolution when possible (ethical considerations)
- host and sample type used to generate the sequence e.g., human serum, tissue, environmental, vector, etc
Additional information
- patient: symptoms, outcomes, treatment/vaccination, PCR test results
- passage history
- sample collection methods
- sequencing & data processing methods (access to raw data)




Left: without much precision.

• What are we looking for? Homologous sequences (sequences are homologous when they share
a common ancestor) = gemeenschappelijke voorouder
• How? Look for excess similarity compared to expectation under random process

→ Goal: maximise similarity. The higher the similarity, the higher the alignment score.
Need to compare the genetic material: positional homology sequences are rearranged such that homologues
nucleotides form columns
How to reach positional homology in practice?
- give a reward for a match
- give a penalty for a mismatch
score(i,j) = λ * log(qij/pi * pj )
-hypothesis 1 (the evolutionary model): qij: probability that nucleotides are
aligned because they’re homologous
-hypothesis 2 (the random model): pi * pj: probability that nucleotides are
aligned in a random alignment
-λ: scaling factor

To show you that with a minimal set of assumptions we can easily devise our
own scoring schemes that are optimal for a settlement in this case 88%
identity alignment.
! it's not that the teachers will ask you to devise your own scoring scheme.

, These matches or mismatches also contribute to genetic diversity,
which is further generated through insertions and deletions. In a
sequence alignment, these appear as gaps. However, unlike matches
and mismatches, which have a straightforward analytical solution,
multiple analytical methods exist for handling gaps.

Fortunately, we can also approach this biologically. When examining
the coding regions of the genome, we know that evolution more often
occurs through the addition or deletion of nucleotides in groups of
three rather than as single nucleotides. This is because the genetic
coding system is based on codons, which consist of three nucleotides. Changes in multiples of three disrupt the protein
code less than single insertions or deletions, which can cause frameshift mutations.

This principle can be incorporated into a scoring model for alignments, as shown here. For example, if you assign a fixed
penalty of -5 to each gap (-), the total score remains the same regardless of how the gaps are placed in the alignment.
However, by using a gap opening penalty and a gap extension penalty, you can assign values that better reflect biological
reality, resulting in a more accurate scoring model.

Database querying in practice: BLAST
Software designed to look up similar sequencing.

Data set compilation




Alignment editing
- always visually inspect an alignment/output
- for coding sequences, the amino acid alignment can guide the nucleotide alignment (particularly useful for divergent
sequences)
- sections that cannot be unambiguously aligned should be removed (avoid non-homology)
- editing can be done manually (always somewhat subjective) or in an automated fashion (more objective)
→ This process can be quite lengthy and time-consuming, especially when dealing with a large number of sequences that
need to be arranged in a matrix. It becomes even more tedious when a new set of sequences becomes available, requiring
you to start over from scratch and wait for the multiple sequence alignment (MSA) to be completed again.

An alternative approach is to create an MSA for the newly added data and
then align this new MSA with the existing one. This is done by generating
profiles for both alignments and comparing them.

There are various methods for performing profile-based multiple sequence
alignments.

On the right side of the slide, there is an example where an alignment with
hundreds of sequences is summarized in just four lines by representing the
percentage of each nucleotide at each position. This significantly reduces
computational requirements and speeds up the process of incorporating new data into an
existing alignment.

Automated alignment editing Depending on the score of a column in alignment (e.g. gap or
similarity score), column is kept or not.
Sequence file formats
- FASTA: Simple format for nucleotide/protein sequences with a header line
starting with ">".
- NEXUS: Complex format for phylogenetic analysis, includes sequence
data and metadata.
- PHYLIP: Format for sequence alignments used in phylogenetic analysis.

, Phylogenies




• neighbor-joining is based on the minimum evolution criterion i.e. the topology that gives the least total branch length is
preferred at each step of the algorithm.
• though the heuristic does not guarantee to find the tree topology with least total branch length
• even though it is sub-optimal in this sense, it has been extensively tested and usually finds a tree that is quite close to the
optimal tree
• very fast
LES 1 2025 – NJ: snel een fylogenetische boom maken
Database querying with BLAST, multiple sequence alignment and phylogenetic reconstruction
After this exercise you will be able to
• query the GenBank database via the web interface and through the command line
• look up sequences similar to a query in a database with BLAST via the command line
• create and edit a multiple sequence alignment
• reconstruct a phylogeny with the NJ algorithm
For this exercise we revisit an analysis of a Yellow Fever Virus (YFV) genome generated from a Dutch patient with a travel
history from Suriname. When this patient got infected in 2017, it represented the first YFV case notified from Suriname in
decades. Around that time, there also was an outbreak of yellow fever in Brazil. The Dutch patient's YFV genome was used
to identify a possible origin for the virus and to evaluate whether the reported YFV was introduced from the ongoing 2016-
2017 Brazilian outbreak. The original report can be found in this post on virological.org.
The workflow is as follows:
1. look up the Dutch patient's YFV sequence
2. fetch a set of highly similar sequences
3. create and edit the multiple sequence alignment
4. reconstruct a NJ tree

Before diving in the data analysis, set up the environment. We begin by loading the Python modules to be used.
from Bio.Blast import NCBIWWW, NCBIXML
from Bio import Entrez, SeqIO, Phylo, AlignIO
import pandas as pd
import time, os, alv, re
from IPython.display import Image
Load the interface to run R embedded in a Python process:
%load_ext rpy2.ipython
[R] Load R packages to be used:
If you work wit %%R than you're not working with python anymore
%%R
library("ape")
Define global variables for the input names that will only have to be changed once.
baseName = "Dutch_patient_YFV"
folderName = "exercise_YFV"
Push that variable to R using the cell magic Rpush
%Rpush baseName folderName
Let's check whether this actually works:
%%R
print(paste0("baseName: ", baseName, " -- folderName: ", folderName))
[1] "baseName: Dutch_patient_YFV -- folderName: exercise_YFV"
Variables can be loaded into bash using the option -s

NOTE
It will become as if it is an input variable of a script that is loaded. This means that variables aren't named any more but are
indexed. Hence, to reference to such variables, use $1 instead of $baseName and $2instead of $folderName

[bash] Let's also check this for bash:
%%bash -s "$baseName" "$folderName"
echo baseName: "$1" -- folderName: "$2"
baseName: Dutch_patient_YFV -- folderName: exercise_YFV

Create a folder where analysis files can saved:
%%bash -s "$folderName"
mkdir "$1" # if folder already exists and you wish to overwrite, you can try $mkdir -p "$1"

Documentinformatie

Geüpload op
22 maart 2025
Aantal pagina's
49
Geschreven in
2024/2025
Type
SAMENVATTING

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
Lorejansens123 Odisee Hogeschool
Bekijk profiel
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
135
Lid sinds
4 jaar
Aantal volgers
34
Documenten
45
Laatst verkocht
1 week geleden

4,1

14 beoordelingen

5
4
4
8
3
1
2
1
1
0

Populaire documenten

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via Bancontact, iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo eenvoudig kan het zijn.”

Alisha Student

Veelgestelde vragen