Bioinformatica
Inhoud
Les 1: intro+ databases ............................................................................................................2
Gendatabase ............................................................................................................................2
Protein database ...................................................................................................................3
Les 2: databases ......................................................................................................................4
Ontologies ............................................................................................................................4
Gene expression ....................................................................................................................5
Phenotypes/Diseases..............................................................................................................6
Model Organism databases ....................................................................................................6
Les 3: genome browsers+ SQL .................................................................................................6
Genome browsers .....................................................................................................................6
Homology.............................................................................................................................7
Database architectures...........................................................................................................8
Les 4: Linux + Jupyter .......................................................................................................... 10
Navigating the file system ..................................................................................................... 10
Additional Jupyternotebooks notes ........................................................................................ 13
Les 5: EMBOSS + BedTools exercises .................................................................................... 14
Les 6: Gene prediction ........................................................................................................... 15
Les 7+8: Python .................................................................................................................... 17
Les 9: Alignment, pattern matching, gene set analysis ............................................................. 19
Werkzitting1 ......................................................................................................................... 25
Information retrieval............................................................................................................ 25
CpG islands ........................................................................................................................ 26
Unknown sequence study...................................................................................................... 28
Werkzitting 2 ........................................................................................................................ 30
Python CpG island .............................................................................................................. 30
miRNA ............................................................................................................................... 32
,Les 1: intro+ databases
Gendatabase
- Entrez gene
o Onderdeel van NCBI: https://www.ncbi.nlm.nih.gov/gene/
o Each line is a transcript isoform (due to alternative promoters, and alternative
splicing); look at the exons, introns, non-coding exons (light greens: 5’UTR,
3’UTR), coding exons (dark green)
o Each transcript has a unique NM_ identifier = RefSeq identifier
o Each NM transcript corresponds to a unique NP_ protein entry
o More details about each NM/NP and links to the sequence in Entrez
Nucleotide are at the bottom of the Gene page
▪ Entrez Nucleotide contains all nucleotide sequences
▪ Search Nucleotide db with NM_000564
▪ (After the dot “.” is the version number)
- Refseq
o https://www.ncbi.nlm.nih.gov/refseq/
o Many sequences were/are represented more than once in GenBank
o RefSeq = curated “secondary” database that aims to provide a
comprehensive, integrated, nonredundant set of sequences
o Goal is to provide a reference sequence for each molecule in the central dogma
(DNA, mRNA, and protein)
o Each RefSeq represents a single, naturally occurring molecule from one
organism
o Nucleotide and protein sequences in RefSeq are explicitly linked to one
another
o Distinct accession number: 2+6 format (2 letters, underscore, six-digit number)
▪ NT_123456 (Genomic contigs), NM_123456 (mRNAs), NP_123456
(Proteins)
▪ XM_123456 (Model mRNAs), XP_123456 (model proteins):
computational predictions
,To visualize the data, download GenBank format (.gb) as textfile and open it in text editor,
such as Visual Studio Code or Jupyternotebooks.
- How to download:
o Click on “Send to” (right upper screen)
o Select “Complete Record” and “File”
o Choose GenBank format or FASTA (no header and features)
- In feature
o Sequence has a coding sequence (CDS) made up of five exons
▪ First exon begins at base 201 and ends at base 224
▪ Then is joined at basepair 1550 until bp 1920, and so forth.
o Each comma in this line represents a splicing event, and each “..” represents
the string of letters between the two coordinates.
o The gene product is eukaryotic initiation factor 4E-II, and the gene name is
eIF4E
EMBL/EBI
o https://www.ebi.ac.uk/
o European database
o DBFETCH provides an easy way to retrieve entries from various databases at
the EMBL-EBI
o Format:https://www.ebi.ac.uk/Tools/dbfetch/db=refseqn;id=NM_000231;form
at=fasta&style=raw
Protein database
- Uniprot: https://www.uniprot.org/
o Gives general feature format (GFF) (text file)
▪ Click download
▪ Choose GFF format
- Protein sequences in databases can be derived from translation of nucleotide
sequences (secondary databases)
o e.g., RefSeq NM_ to RefSeq NP_
o e.g.,TrEMBL
o Go to the protein database, following one of the NP_isoforms
- There are also curated databases: experts enhance the original data by adding new
information
, o e.g., SwissProt (in the UniProt knowledgebase)
▪ Information from literature
▪ Curator-evaluated computational analysis/predictions
- 3D structures
o https://www.ncbi.nlm.nih.gov/structure/ or Uniprot→ structure
Les 2: databases
Ontologies
- Gene ontology (GO)
o https://geneontology.org/ or https://www.ebi.ac.uk/QuickGO/ (human usually
capitalized)
▪ Data downloaden QuickGo
• Click on export
• Choose format: gen association file (then add .txt in the name)
• Adjust the amount of annotations
o Specific purpose: “Annotation of genes and proteins in genomic and protein
databases”
o Facilitate complex queries
o Applicable to all species
o Databases involved:
▪ FlyBase (Drosophila)
▪ MGI (Mouse)
▪ SGD (S. cerevisae)
▪ TAIR (Arabadopsis)
▪ TIGR (microbes including prokaryotes)
▪ SWISS-PROT (several thousand species inc. human)
▪ PSU (P. falciparum)
▪ ZFIN (zebrafish)
▪ PAMGO (plant pathogens)
o GO structure
Inhoud
Les 1: intro+ databases ............................................................................................................2
Gendatabase ............................................................................................................................2
Protein database ...................................................................................................................3
Les 2: databases ......................................................................................................................4
Ontologies ............................................................................................................................4
Gene expression ....................................................................................................................5
Phenotypes/Diseases..............................................................................................................6
Model Organism databases ....................................................................................................6
Les 3: genome browsers+ SQL .................................................................................................6
Genome browsers .....................................................................................................................6
Homology.............................................................................................................................7
Database architectures...........................................................................................................8
Les 4: Linux + Jupyter .......................................................................................................... 10
Navigating the file system ..................................................................................................... 10
Additional Jupyternotebooks notes ........................................................................................ 13
Les 5: EMBOSS + BedTools exercises .................................................................................... 14
Les 6: Gene prediction ........................................................................................................... 15
Les 7+8: Python .................................................................................................................... 17
Les 9: Alignment, pattern matching, gene set analysis ............................................................. 19
Werkzitting1 ......................................................................................................................... 25
Information retrieval............................................................................................................ 25
CpG islands ........................................................................................................................ 26
Unknown sequence study...................................................................................................... 28
Werkzitting 2 ........................................................................................................................ 30
Python CpG island .............................................................................................................. 30
miRNA ............................................................................................................................... 32
,Les 1: intro+ databases
Gendatabase
- Entrez gene
o Onderdeel van NCBI: https://www.ncbi.nlm.nih.gov/gene/
o Each line is a transcript isoform (due to alternative promoters, and alternative
splicing); look at the exons, introns, non-coding exons (light greens: 5’UTR,
3’UTR), coding exons (dark green)
o Each transcript has a unique NM_ identifier = RefSeq identifier
o Each NM transcript corresponds to a unique NP_ protein entry
o More details about each NM/NP and links to the sequence in Entrez
Nucleotide are at the bottom of the Gene page
▪ Entrez Nucleotide contains all nucleotide sequences
▪ Search Nucleotide db with NM_000564
▪ (After the dot “.” is the version number)
- Refseq
o https://www.ncbi.nlm.nih.gov/refseq/
o Many sequences were/are represented more than once in GenBank
o RefSeq = curated “secondary” database that aims to provide a
comprehensive, integrated, nonredundant set of sequences
o Goal is to provide a reference sequence for each molecule in the central dogma
(DNA, mRNA, and protein)
o Each RefSeq represents a single, naturally occurring molecule from one
organism
o Nucleotide and protein sequences in RefSeq are explicitly linked to one
another
o Distinct accession number: 2+6 format (2 letters, underscore, six-digit number)
▪ NT_123456 (Genomic contigs), NM_123456 (mRNAs), NP_123456
(Proteins)
▪ XM_123456 (Model mRNAs), XP_123456 (model proteins):
computational predictions
,To visualize the data, download GenBank format (.gb) as textfile and open it in text editor,
such as Visual Studio Code or Jupyternotebooks.
- How to download:
o Click on “Send to” (right upper screen)
o Select “Complete Record” and “File”
o Choose GenBank format or FASTA (no header and features)
- In feature
o Sequence has a coding sequence (CDS) made up of five exons
▪ First exon begins at base 201 and ends at base 224
▪ Then is joined at basepair 1550 until bp 1920, and so forth.
o Each comma in this line represents a splicing event, and each “..” represents
the string of letters between the two coordinates.
o The gene product is eukaryotic initiation factor 4E-II, and the gene name is
eIF4E
EMBL/EBI
o https://www.ebi.ac.uk/
o European database
o DBFETCH provides an easy way to retrieve entries from various databases at
the EMBL-EBI
o Format:https://www.ebi.ac.uk/Tools/dbfetch/db=refseqn;id=NM_000231;form
at=fasta&style=raw
Protein database
- Uniprot: https://www.uniprot.org/
o Gives general feature format (GFF) (text file)
▪ Click download
▪ Choose GFF format
- Protein sequences in databases can be derived from translation of nucleotide
sequences (secondary databases)
o e.g., RefSeq NM_ to RefSeq NP_
o e.g.,TrEMBL
o Go to the protein database, following one of the NP_isoforms
- There are also curated databases: experts enhance the original data by adding new
information
, o e.g., SwissProt (in the UniProt knowledgebase)
▪ Information from literature
▪ Curator-evaluated computational analysis/predictions
- 3D structures
o https://www.ncbi.nlm.nih.gov/structure/ or Uniprot→ structure
Les 2: databases
Ontologies
- Gene ontology (GO)
o https://geneontology.org/ or https://www.ebi.ac.uk/QuickGO/ (human usually
capitalized)
▪ Data downloaden QuickGo
• Click on export
• Choose format: gen association file (then add .txt in the name)
• Adjust the amount of annotations
o Specific purpose: “Annotation of genes and proteins in genomic and protein
databases”
o Facilitate complex queries
o Applicable to all species
o Databases involved:
▪ FlyBase (Drosophila)
▪ MGI (Mouse)
▪ SGD (S. cerevisae)
▪ TAIR (Arabadopsis)
▪ TIGR (microbes including prokaryotes)
▪ SWISS-PROT (several thousand species inc. human)
▪ PSU (P. falciparum)
▪ ZFIN (zebrafish)
▪ PAMGO (plant pathogens)
o GO structure