Unit 4 – Year 2022/2023
Contents
Lecture 1: Data Management Systems, Relational, and SQL .............................................................. 2
Relational data model ............................................................................................................................ 3
Single table queries using SQL ............................................................................................................. 4
Lecture 2: Entity Relationships, and Translating from a Natural Language Perspective................ 5
Basic concepts ........................................................................................................................................ 5
Lecture 3: Translating ERD to DB Schema, and Database Normalization ....................................... 8
Transforming entity relationship diagram to relational schema ..................................................... 8
Normalization ........................................................................................................................................ 11
Lecture 4: Evolution of Data Management, Big Data, and Data Intensive Systems .................... 12
Evolution of Data Management ......................................................................................................... 12
Big data and its challenges .................................................................................................................. 12
Big data analytics .................................................................................................................................. 13
Reasons for going beyond traditional RDBMS ................................................................................ 13
Lecture 5: The Spark Ecosystem, RDDs, Programming Model, and PySpark ................................ 17
Lambda expressions ............................................................................................................................. 17
Apache Spark ........................................................................................................................................ 18
Programming models ........................................................................................................................... 19
Lecture 6: Data Transformations with SQL, Entity Recognition, Data Cleaning Tools ................ 20
Views ...................................................................................................................................................... 21
Functions ............................................................................................................................................... 21
Creating and populating ...................................................................................................................... 22
Data from websites, integration & cleaning, entity extraction & resolution .............................. 22
,Lecture 1: Data Management Systems, Relational, and SQL
Reasons for Database Management Systems (DBMS):
In the early days database applications were built on top of file systems,
Drawbacks:
• Data redundancy and inconsistency: multiple file formats, duplication in different files
• Difficulty in accessing data: need to write a new program to carry out each new task
• Data isolation: multiple files and formats
• Integrity problems: integrity constraints become buried in program code rather than being
stated explicitly
- Hard to add new constraints or change existing ones
• Atomicity of updates: failures leave data in an inconsistent state with partial updates carried
out
• Concurrent access by multiple users: needed for performance, uncontrolled concurrent
accesses can lead to inconsistencies
• Security problems: hard to provide user access to some, but not all, data
A database (DB) is a collection of data with the same structure, including correlations and
relationships, that is defined for a particular use and used by several users
A database management system (DBMS) is a collection of programs over DB that specify the
data types, structure, constraints, and store this on a disk, retrieve, update, and mange access
rights
• A black box interacting between users/applications and the database
• The ultimate goal is to separate data from application
- Provide an interface that the application programmer must follow
- Allow system administrator to make modifications without having an impact on the user
- Users can change their view of the data without having to worry about how it is stored
There are different layers within the DBMS:
• External layer: communication with users
- Analysis of user requests (queries)
- Access control
- Answer presentation
• Logical layer:
- Optimization of queries
- Resolving conflicting accesses, i.e., multiple users
- Guarantees constant availability even in case of failures
• Internal layer:
- Storing the data
- Software for structuring the data
- Efficient access methods, i.e., keys, indices, etc.
System development life cycle
1. Planning: develop a preliminary understanding of the business situation and how information
systems might help solve the problem
• Enterprise modeling: analyze current data processing, the general business functions and
their database needs, and justify need for new data and databases in support of business
• Conceptual data modeling: identify scope of database requirements for proposed
information system and analyze overall data requirements for business function(s) supported
by database
• Understand current data processing
• Understand general business functions and needs
, 2. Analysis: analyze the business situation thoroughly to determine requirements and to
structure those requirements
• Conceptual data modeling:
- Develop preliminary conceptual data model, including entities and relationships
- Compare to enterprise data model
- Develop detailed conceptual data model, including entities, relationships, attributes, and
business rules
- Make conceptual data model consistent with other models of information systems
- Populate repository with all conceptual database specifications
• Output: conceptual schema
• Corresponds to a detailed technology independent specification of the overall organizational
data structure
3. Design
• Logical: representation of the DB; transform the conceptual schema into a logic schema,
which describes the data in terms of the data management technology that will be used to
implement the database
• Physical: the set of specifications that describe how data from a logical schema are stored in
a computer’s secondary memory by a specific database management system
4. Implementation: a designer writes, tests, and installs the program/scripts that access, create,
or modify the database
• Finalize all database documentation, train users, and put procedures into place for the
ongoing support of the information system users
• Populate with data
• Install application(s) and test
• Complete documentation and training materials
5. Maintenance: add, delete, or change characteristics of the structure of the database in order
to meet changing business conditions, to correct errors in database design, or to improve the
processing speed of database applications
• Monitor the operation and usefulness of the system
• Repair by fixing errors in database and applications
• Enhance by analyzing the database and applications to ensure that evolving information
requirements are met
Different types of DBMS:
• Traditional DBMS: text and numerical data
• Multimedia DBMS: multimedia data
• Spatial DBMS: geographic and geometric data
• Data warehouses
Relational data model
A relational model is an approach to managing data by representing it grouped into relations
A relational DBMS (RDBMS) is a database management system that manages data as a collection
of tables in which all relationships are represented by common values in related tables