Bioinformatics summary
Lecture 1 databases
Bioinformatics
• Lookup → databases: organized for fast retrieval of the data
• Compare → information transfer => alignment
• Predict → information transfer
Databases
Database: growing very fast, large amount of information stored → fast retrieval of the data
necessary
➔ Databases’ innovation: technological, logistical and administrative
• Heterogenous data types: different kinds of data
• Homogenous data types: the same data
Fast growth due to reduced cost and time for sequencing
Primary databases: contain experimental data and annotation information → biomolecular
sequences/structures
• Nucleic acid sequences: EMBL, Genbank, DDBJ
• Protein sequences: Swissprot, trEMBL, Uniprot
• Protein structures: PDB
• Small compounds’ structures: CSD
• Genomes: ensembl, USCS
Secondary databases: contain data derived from primary databases → disease mutations
• Patterns, motives, domains: PROSITE, PFAM, PRINTS, INTERPRO
• Disease mutations: OMIM, MIM
• SNP’s: dbSNP
• Pathways: KEGG
Databases’ format
In order to function, all databases must come in a specific format for the software to be recognized
Essential components of format depend on content of database, however data elements that are
essential for each database are:
- Unique identifier = accession code
- Name of depositor
- Literature references
- Deposition data
- The real data
Nomenclature:
- Database entry/record
- Database fields
, Data quality
“Quality” of the database’s data in the sense of it being true according to nowadays’ understanding
depends on several aspects:
• Deposition date → you must be able to find is therefore essential for a database
• Automatic check
• Who has access to drop files? → depositor mentioned, cross references
• Are there annotations to the data?
Swissprot: only submitted by experts, manually annotated and reviewed
➔ Pro: high quality data
➔ Con: less data findable than in less strict databases
EMBL, PDB, uniprot: everyone that wants to submit data can do that
➔ Pro: lots of data stored in the database → high chance of finding what you’re looking for
➔ Con: not always such high quality data
Swissprot
Only submission by experts
Info reviewed → updated, checking if info is correct
Manual fact check during deposition
Manually added annotations
Obligatory deposit in Swissprot before publication
Swissprot is part of Uniprot
Other part of uniport is Tremble → low quality (relatively)
Swissprot is a keyword organized flat-file
Depends on the database how you can view the data
Important swissprot fields:
1)
2) cross references: hyperlinks to entries in all other databases that are related to the specific data in
Swissprot
Lecture 1 databases
Bioinformatics
• Lookup → databases: organized for fast retrieval of the data
• Compare → information transfer => alignment
• Predict → information transfer
Databases
Database: growing very fast, large amount of information stored → fast retrieval of the data
necessary
➔ Databases’ innovation: technological, logistical and administrative
• Heterogenous data types: different kinds of data
• Homogenous data types: the same data
Fast growth due to reduced cost and time for sequencing
Primary databases: contain experimental data and annotation information → biomolecular
sequences/structures
• Nucleic acid sequences: EMBL, Genbank, DDBJ
• Protein sequences: Swissprot, trEMBL, Uniprot
• Protein structures: PDB
• Small compounds’ structures: CSD
• Genomes: ensembl, USCS
Secondary databases: contain data derived from primary databases → disease mutations
• Patterns, motives, domains: PROSITE, PFAM, PRINTS, INTERPRO
• Disease mutations: OMIM, MIM
• SNP’s: dbSNP
• Pathways: KEGG
Databases’ format
In order to function, all databases must come in a specific format for the software to be recognized
Essential components of format depend on content of database, however data elements that are
essential for each database are:
- Unique identifier = accession code
- Name of depositor
- Literature references
- Deposition data
- The real data
Nomenclature:
- Database entry/record
- Database fields
, Data quality
“Quality” of the database’s data in the sense of it being true according to nowadays’ understanding
depends on several aspects:
• Deposition date → you must be able to find is therefore essential for a database
• Automatic check
• Who has access to drop files? → depositor mentioned, cross references
• Are there annotations to the data?
Swissprot: only submitted by experts, manually annotated and reviewed
➔ Pro: high quality data
➔ Con: less data findable than in less strict databases
EMBL, PDB, uniprot: everyone that wants to submit data can do that
➔ Pro: lots of data stored in the database → high chance of finding what you’re looking for
➔ Con: not always such high quality data
Swissprot
Only submission by experts
Info reviewed → updated, checking if info is correct
Manual fact check during deposition
Manually added annotations
Obligatory deposit in Swissprot before publication
Swissprot is part of Uniprot
Other part of uniport is Tremble → low quality (relatively)
Swissprot is a keyword organized flat-file
Depends on the database how you can view the data
Important swissprot fields:
1)
2) cross references: hyperlinks to entries in all other databases that are related to the specific data in
Swissprot