100% de satisfacción garantizada Inmediatamente disponible después del pago Tanto en línea como en PDF No estas atado a nada 4.2 TrustPilot
logo-home
Resumen

Summary Introduction to Data Science Entire Course

Puntuación
-
Vendido
6
Páginas
18
Subido en
21-03-2022
Escrito en
2021/2022

Everything you need to know for the IDS exam!

Institución
Grado










Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Escuela, estudio y materia

Institución
Estudio
Grado

Información del documento

Subido en
21 de marzo de 2022
Número de páginas
18
Escrito en
2021/2022
Tipo
Resumen

Temas

Vista previa del contenido

INTRODUCTION TO DATA SCIENCE


Paper 2: Data Life Cycle (CRISP-DM)
CRISP-DM stands for cross-industry process for data mining. It provides a structured
approach to planning a data mining project. The model is a sequence of events, it has 6
phases:
1. Business understanding:
2. Data understanding
3. Data preparation
4. Modelling
5. Evaluation
6. Deployment

STAGE 1: BUSINESS UNDERSTANDING
understand what you want to accomplish from a business perspective
 What are the desired outputs of the project?
 Assess the current situation:
o List all resources like personnel, data, computing resources, software.
o List all requirements, assumptions, and constraints.
o List all risks or events that might delay the project or cause it to fail
o Compile a glossary of terminology relevant to the project.
o Construct a cost-benefit analysis for the project which compares the costs of
the project with the potential benefits to the business if it is successful.
 Determine data mining goals:
o Business success criteria: describe the intended outputs of the project that
enable the achievement of the business objectives.
o Data mining success criteria: define the criteria for a successful outcome to the
project in technical terms
 Produce project plan:
o Project plan: list the stages to be executed in the project, together with their
duration, resources required, inputs, outputs, and dependencies.
o Initial assessment of tools and techniques

STAGE 2: DATA UNDERSTANDING
Acquire the data listed in the project resources
 Collect the data: sources and methods
 Describe the data: including its format, quantity, identities of the fields
 Explore the data: visualize the data by looking at relationships between attributes,
distribution of attributes, and simple statistical analyses.
 Verify data quality: is it complete and correct, are there errors or missing values?

,STAGE 3: DATA PREPARATION
 Select your data: decide on the data that you are going to use for analysis.
 Clean your data: raise the data quality to the level required by the analysis
techniques that you have selected, by for example selecting clean subsets of data /
handling missing data.
 Construct required data: derive new attributes / generate records (completely new)
 Integrate data: merge and aggregate data

STAGE 4: MODELLING
 Select modeling technique: together with any modelling assumptions.
 Set up test and training sets
 Build the model: list the parameter settings, the models produced and the model
descriptions.
 Assess the model: discuss results with experts (considering project goal) and revise
parameter settings: tune them for the next modelling run.
Iterate model building and assessment until you strongly believe that you have found the
best model.

STAGE 5: EVALUATION
 Evaluate your results: judge quality of model by taking business criteria into account
and approve the proper models
 Review process: check if approved model fulfils and satisfies tasks & requirements
 Determine next steps: list the possible actions and decide what to do.

Paper 3: Principles of Data Wrangling
Structure of a dataset refers to the format and encoding of its records and fields.
 You want a rectangular dataset: table with a fixed number of rows and columns.
 If the record fields in a dataset are not consistent (some records have additional
fields, others are missing fields), then you have a jagged table.
 The encoding of the dataset specifies how the record fields are stored and presented
to the user, like what time zones are used for times.
 In many cases, it is advisable to encode a dataset in plain text, such that it is human-
readable. Drawback: takes up a lot of space.
 More efficient is to use binary encodings of numerical values.
 Finding out the structure is mostly about counting the number of records and fields
in the dataset and determining the dataset’s encoding.
 A few extra questions to ask yourself when assessing the structure of a dataset:
o Do all records in the dataset contain the same fields?
o How are the records delimited in the dataset?
o What are the relationship types between records and the record fields?

Granularity of a dataset refers to the kinds of entities that each data record represents or
contains information about.
 In their most common form, records in a dataset will contain information about
many instances of the same kind of entity (like a costumer ID).

,  We look at granularity in terms of coarseness and fineness: the level of depth or the
number of distinct entities represented by a single record of your dataset.
 Fine: single record represents a single entity (single transaction at store)
 Coarse: single record represents multiple entities (sales per week per region)
 A few questions to ask yourself when assessing the data granularity:
o What kind of things do the records represent? (Person, object, event, etc.)
o What alternative interpretations of the records are there?
 If the records are customers, could they actually be all known
contacts (only some of which are customers)?
o Example: one dataset has as location the country, the other dataset has
coordinates

Accuracy of a dataset refers to its quality: the values populating record fields in the dataset
should be consistent and accurate.
 Common inaccuracies are misspellings of categorical variables, lack of appropriate
categories, underflow and overflow of numerical values, missing field components
 A few questions to ask yourself when assessing the data accuracy:
o Check if the date times are specific, are the address components consistent,
and correct, are numeric items like phone numbers complete?
o Is data entered by people? Because that increases the chance of misspellings.
o Does the distribution of inaccuracies affect many records?

Temporality deals with how accurate and consistent the data is over time.
 Even when time is not explicitly represented in a dataset, it is still important to
understand how time may have impacted the records in a dataset.
 Therefore, it is important to know when the dataset was generated.
 A few questions to ask yourself when assessing the data temporality:
o Were all the records and record fields collected at the same time?
o Have some records or record fields been modified after the time of creation?
o In what ways can you determine if the data is stale?
o Can you forecast when the values in the dataset might get stale?

Scope of a dataset has 2 dimensions:
 1) The number of distinct attributes represented in a dataset.
 2) The attribute-by-attribute population coverage: are all the attributes for each field
represented in the dataset, or have some been randomly, intentionally, or
systematically been excluded.
 The larger the scope, the larger the number of fields.
 As with granularity, you want to include only as much detail as you might use.
 A few questions to ask yourself when assessing the scope of your data:
o Given the granularity, what characteristics of the things represented by the
records are captured by the record fields? And what characteristics are not?
o Are the record fields consistent? Like does the customer’s age field make
sense relative to the date-of-birth field?
o Are the same record fields available for all records?
o Are there multiple records for the same thing? If so, does this change
granularity?
$7.78
Accede al documento completo:

100% de satisfacción garantizada
Inmediatamente disponible después del pago
Tanto en línea como en PDF
No estas atado a nada

Conoce al vendedor

Seller avatar
Los indicadores de reputación están sujetos a la cantidad de artículos vendidos por una tarifa y las reseñas que ha recibido por esos documentos. Hay tres niveles: Bronce, Plata y Oro. Cuanto mayor reputación, más podrás confiar en la calidad del trabajo del vendedor.
femkestokkink Vrije Universiteit Amsterdam
Seguir Necesitas iniciar sesión para seguir a otros usuarios o asignaturas
Vendido
42
Miembro desde
4 año
Número de seguidores
40
Documentos
11
Última venta
1 año hace

4.0

3 reseñas

5
1
4
1
3
1
2
0
1
0

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

Student with book image

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes