Inhoud
Les 1: Intro to NGS, genome variation, genomic medicine.....................................1
Applications of genomic variations......................................................................2
(NGS) methods to determine genetic variation...................................................3
Les 2: Evaluating and processing raw sequence data............................................4
NGS analysis pipeline.......................................................................................... 4
General useful commands jupyter.......................................................................8
Case studies...................................................................................................... 11
Les 3: Variant calling and annotation...................................................................13
Variant calling.................................................................................................... 13
Jupyter case study (1)- variant calling...............................................................16
Variant annotation............................................................................................. 17
Jupyter case study (1)- annotation....................................................................18
Les 4: Non-coding genetic variants......................................................................19
Enformer (python language)............................................................................. 21
Les 5/6: Variant interpretation & personal genomics............................................22
Les 7: Copy number variation............................................................................... 25
Les 8+9: Complex structural variation.................................................................27
Les 10: Single Cell CNA calling............................................................................. 34
R (similar to python language)..........................................................................36
Questions in the Jupyter notebook....................................................................37
Les 11: Guest speakers........................................................................................ 39
Les 1: Intro to NGS, genome variation, genomic medicine
Genomic variation is related to disease. There are different types of variation:
1
, - SNPs “DNA spelling mistakes”, one nucleotide change
- INDELs “extra or missing DNA”, some nucleotides inserted or deleted
- SVs Large blocks of extra, missing or rearranged DNA
Applications of genomic variations
Health conditions:
1. Non-invasive prenatal test (NIPT)
2. Mendelian disorders
a. Trio-based sequencing unaffected parents and an affected offspring
b. SMA, BRCA1
3. Complex diseases: polygenic risk
a. Not one gene is responsible= polygenic risk
b. Many traits are polygenic Wide Association Study: associate absence/presence of
SNPs in cases (with disease) and controls (without disease)
i. P-value of every SNP tested associate to disease
c. Also can do a gene prioritisation if a SNP is present, is the gene expression higher?
Try to attribute a SNP to the closest gene present.
d. Another way is to quantify genetic risk as a diagnostic tool
e. Alzheimer's disease
i. Everything above the red line is significant meta-analysis of Alzheimer’s
4. Cancer
genomics:
a. Somatic mutations very different genetic profiles
b. Far more so than in the other areas discussed above, driver genes and mutations in
cancer provide clear molecular targets for therapeutic agents broad application
c. Non-small cell lung cancers with activating somatic mutations in the EGFR kinase
EGFR kinase inhibitor gefitinib
d. TCGA and PACWG: broad surveys
i. About half of the common tumours contain one or more clinically relevant
mutations, predicting sensitivity or resistance to specific agents or suggesting
clinical trial eligibility
e. Tumours shed DNA in the blood circulating tumour DNA (ctDNA) liquid
biopsies
2
, f. Evolution graphs of mutations to see where the problems are personalised medicine
Traits: Genomic variance also leads to different traits such as length, eye colour etc.
Ancestry: Genetic variants are the "bread crumbs" for tracking evolution
(NGS) methods to determine genetic variation
Restriction fragment length polymorphism Restriction enzymes cut DNA yielding fragments of
different sizes. Mutations may disrupt this pattern which is linked to disease.
Arrays and NGS have resulted in an explosion of genomic testing 2 key technologies:
1. High-density DNA microarrays to genotype millions of specific positions in each of many
human genomes. Coupled with population-based maps of linkage disequilibrium (LD), array-
based genotyping enables the ascertainment of the most common genetic variation in a human
genome for a low-cost
2. Massively parallel DNA sequencing technologies can generate billions of short sequencing
reads within a day or less next generation sequencing (NGS) now permits the near-
comprehensive ascertainment of both rare and common genetic variation.
Most technologies have the DNA sequencing information in a FASTQ format. De multiplex reads
generates 2 FASTQ files for each sample (forwards and reverse read). Different types of genome
alterations that can be detected by NGS.
Types of point mutations in protein-coding genes
Mutations in regulatory regions are harder to interpret. With machine learning approaches we can
understand genetic variations.
3
, Les 2: Evaluating and processing raw sequence data
NGS analysis pipeline
Three main formats:
1. Raw reads (FASTQ)
2. Alignment file (SAM/BAM)
3. vcf
Raw reads
Start with sequencing (FASTQ) e.g. Illumina; sequencing by synthesis
1. First line is the identifier starts with @
2. Second line is the sequence
3. Third line is +=separator
4. Fourth line is quality sequence how good/certain the sequence is
Phred-score are quality scores of the certainty of the base that is correctly recorded (0-40)
Everything >28 is good.
The scores are encoded every symbol/letter is representative for numbers:
https://en.wikipedia.org/wiki/Phred_quality_score
Illumina coding is mostly used nowadays.
4