Chapter 1. Basic tools of bioinformatics
1. Genome browsers
2. Visualizing sequence data
3. Searches
4. Databases
5. Expression
6. Enrichment
7. Proteins
8. Summary
Chapter 2. TBL CRISPR
1. CRISPR
2. Protein interactions
3. Alphafold
Chapter 3. Systems biology
1. Systems biology approach
2. ODE modelling
Chapter 4. DNA sequencing
1. Sequencing methods
2. Quality control
3. Mapping
4. Variant calling
Chapter 5. Phylogeny
Chapter 6. RNA sequencing
1. RNA-seq
2. RNA-seq data analysis
3. scRNA-seq
Chapter 7. NGS applications
,Chapter 1. Basic tools of bioinformatics
1. Genome browsers
Content covered:
- Genome assembly (Reads, Contigs, Scaffolds. N50, Repeats)
- GRCh38
- Automated annotation
- Transcript support level (TSL)
- Representative/main transcript (MANE, ENSEMBL canonical, APPRIS)
- Manual annotation (HAVANA)
- Using ENSEMBL
o Finding genes
o Finding transcripts
o Finding exons and introns
o Finding splice variants
o Finding encoded protein
o Interpreting the genome browser
Assembling a genome
Reads are assembled into contigs.
Contigs are assembled to scaffolds.
Gaps between contigs are filled and the long sequences are clustered and assembled into
chromosomes.
Quality of assembling
Coverage (reads)
Number of contigs
N50: the smallest contig length that is required to cover 50% of genome
Gaps
Consensus of what parts to use to assemble the human genome
GRCh: Genome Reference Consortium Human
There are three levels of updates:
- Build: the actual assembly of the genome
- Patch: information is added without changing coordinates
- Release: updates
Current Assembly (Ensembl): GRCh38.p13 (Build 38, patch 14) Release113
Genome data has to be combined to other databases
Includes: RefSeq, GenBank, CCDS, UniProt, other individuals (healthy or not), other species.
,Making sense of a genome
There are:
- Regulatory sequences
- Exons
- Introns
- mRNA
- Non-coding RNAs
- Open reading frames
- Splice variants …
Human genome assemblies dates
GRCh38: released in Dec 2013, equivalent UCSC version hg38.
GRCh37: released in Feb 2009, equivalent UCSC version hg19.
NCBI Build 36.1: released in Mar 2006, equivalent UCSC version hg18.
Automated genome annotation
Genome annotation: the process of attaching biological information to sequences.
Answers these questions:
- Is there a cDNA sequence or Refseq for the predicted transcript?
- Is it a stable mRNA?
- Does it contain an ORF?
- Is the protein expressed?
- Predicted domains? Activity and/or function?
- Is it conserved?
- Are mutations linked to phenotypes/diseases?
HAVANA manual annotation
Human and Vertebrate Analysis and Annotation (HAVANA): a manually curated annotation of
genomes.
- Only Human, Mouse, Rat and Zebrafish.
- Not for all transcripts.
Stable ID in Ensembl
For humans:
- Regulatory region: ENSR…
- Genes: ENSG…
- Exons: ENSE…
- Transcript: ENST…
- Protein ENSP…
For other species: Extra three letter code Mouse Mus musculus (ENSMUSR…. ENSMUSG…)
, TSL: Transcript support level
TSL1: all splice junctions of the transcript are supported by at least one non-suspect mRNA
TSL2: the best supporting mRNA is flagged as suspect or the support is from multiple ESTs*
TSL3: the only support is from a single EST
EST: Expressed sequence tag (short sequence from cDNA)
TSL4: the best supporting EST is flagged as suspect
TSL5: no single transcript supports the model structure
Representative transcripts
MANE Select: Matched Annotation between NCBI and Ensembl
Ensembl canonical
HAVANA curated
APPRIS
APPRIS: Annotation of principal and alternative splice isoforms
Annotation of splice forms based on structural, functional and cross-species conservation.
- APPRIS P1: Main functional transcript
- APPRIS P2 + APPRIS ALT
Note that APPRIS is based on the encoded protein. Transcripts that encode for identical
proteins will get identical APPRIS annotation.
The 1000 genome project
The 1000 Genomes Project: created a catalogue of common human genetic variation, using
openly consented samples from people who declared themselves to be healthy.
- Original release 2008.
Final release 2015 included 2504 samples from 26 populations.