Bioinformatics= study of informatic processes in biotic systems
Bioinformatic data analysis= using computational methods to analyse biological data
>no need to grow/culture to study organism, but directly from sample
-OMICS = looking at entirety off … in organism/tissue/cell by sequencing [reduces bias]
▪ Genomics: DNA off one organism
▪ Transcriptomics: mRNA in organism/tissue/cell
▪ Proteomics: proteins in organism/tissue/cell
META- = sequencing whole … of all organisms in a sample (Whole-genome shotgun sequencing)
o Metagenomics: considers only DNA material
o Metatranscriptomics: mRNA
o Metaproteomics: proteins
Microbiome: all microbes (most studied and well defined bacteria are human-associated)
> most interested in human related diseases/food/ themselves which causes bias in understanding in biology and databases
1) QUESTION FIRST: choosing the dataset based on a given biological question [top-down]
2) DATA FIRST: choosing a biological hypothesis tot test based on a given dataset [bottom-up]
> looking for sequences similar to ‘query’ sequence in database <
▪ K-mer searches: dividing sequences into shorter subsequences (k-mers consisting ‘k’ nucleotides)
- Needed due to possible mutations
- limited to need of exact match
- Splitting of query sequence into k-mers to rapidly identify all databases containing the sequence
▪ Natural sequence divergence: aligning metagenomic sequencing reads to reference genome [pairwise]
- the more exact hits, the more closely related (also identifies more distantly related strains)
▪ Sequence alignment: aligning two sequences so they match as well as possible
- introduces ‘gaps’ which are thought to have mutated through evolution
BLAST= ‘Basic Local Alignment Search Tool’
> combines exact k-mers (quickly finding potential hits) and pairwise alignment (only for potential hits)
• Query: sequence we search the database with
• Hit/subject: similar sequence found in the database
• Heuristic: practical method not guaranteed to be optimal, but sufficient for present goals
I) Identifies all words (query length ‘W’) → W=3 for protein & W=11 for DNA [based on substitutions]
II) quickly finds similar words → defined by substitution matrix & neighbourhood score threshold (T)
> exact match or above ‘neighbourhood score threshold’ (low= more words included)
> higher
III) extends in both direction to find HSPs between query and hit → bigger match than given word/W?
> HSP= region that can be aligned with a score above a certain threshold
Local alignment: finds the optimal sub-alignment within two sequences (partial homologous, small parts)
Global alignment: aligns two sequences from end to end (known to be complete homologous due to gene duplication)
, Nucleotide-nucleotide searches
> blastn [W=11]: finds homologous genes in different species
> megablast [W=28]: finds longer alignments between similar nucleotide sequences (same species)
> discontiguous megablast: uses discontinuous words W= 11 gives AT-GT-AC-CG-CG-T… (focus on codons)
- the third nucleotide of condons is less conserved to the degeneracy of the genetic code
Protein-protein searches (protein database & protein query sequences)
> blastp [W=3 amino acids]: homologous proteins in different species
> blastx & tblastx: first translate the query from nucleotide into protein before identifying high-scoring words
> tblastn & tblastx: use a translated database of nucleotide sequences stored as proteins
BLAST TERMEN
Query cover= the percentage of your input sequence (the "query") that is aligned to a sequence in the database
Identity= the percentage of matching bases or amino acids between two aligned sequences
Bits-core= a measure of the quality of an alignment, reflecting the statistical significance of the similarity
between two sequences
E-value (expect)= the number of hits of the same quality one can expect to see by chance when searching a
random database of this particular size
--> E-value= X, than we expect X hits of similar or better quality in the NCBI database simply by chance
BB-TOETJE 1
BLAST-flavor Functie/vergelijking van: toepassing
Blastn Nucleotide query > nucleotide-database Brede overeenkomst hits
Megablast Nucleotide query > nucleotide-database (grotere alignment) Voor nauw-verwante hits
Blastp Eiwit query > eiwit-database Eiwitten vergelijken
blastx nucleotide query > eiwit-database Stuk DNA/RNA dat mogelijk
codeert
tblastn Eiwit query > getransleerde nucleotide-database Van bekend eiwitsequentie naar
mogelijk coderende regio’s in
gen (exon)
tblastx nucleotide query > getransleerde nucleotide-database gedivergeerde genen of
eiwitcoderende regio’s tussen
twee sets DNA
> local alignment [standard in BLAST] = partial homology vs. global alignment= full homology [bekend]
> E-value: een random data-base met dezelfde grootte bevat naar verwachting … hits met totale score …
= [verwachtingswaarde] geeft aan hoeveel hits met een vergelijkbare of betere score je toevallig zou verwachten in een
willekeurige database van dezelfde grootte.
BLAST
1. Maskeer low-complexity regio’s: voorkomt dat oninformatie gebieden valse hits veroorzaken
2. Maak een lijst van high-scoring words/kmers: filteren op woorden met hoge score
3. Maak een lijst van neighborhood words/kmers: query opdelen
4. index search met high-scoring words/kmers: terugzoeken in database
5. Verleng alignment: vanaf hit uitbreiding
Door de omicsrevolutie hebben we meer genomen van verschillende organismen in de databases en focussen we niet op
een handvol vaak onderzochte organismen.