100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
Samenvatting

Summary Reading Material

Beoordeling
-
Verkocht
4
Pagina's
34
Geüpload op
18-12-2019
Geschreven in
2019/2020

For the course Introduction to Data Science, you get a lot of extra reading material (articles, papers, etc.). It has helped me quite a bit to summarise (or at least make an overview of) this material. In the test, they ask a considerable amount of questions about this, so it's nice for you to read this just before your exam.

Meer zien Lees minder











Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Documentinformatie

Geüpload op
18 december 2019
Aantal pagina's
34
Geschreven in
2019/2020
Type
Samenvatting

Voorbeeld van de inhoud

Summary Reading Material
Introduction to Data Science

2019

,LECTURE 1 - A TAXONOMY OF DATA SCIENCE

http://www.dataists.com/2010/09/a-taxonomy-of-data-science/

A useful taxonomy for data science would be OSEMN: Obtain, Scrub, Explore, Model and iNterpret. Ideally, a
data scientist should be at home with them all.

OBTAIN

Part of the skillset of a data scientist is knowing how to obtain a sufficient corpus of usable data, possibly form
multiple sources. At least, one should know how to do this in a UN*X environment or in Python. Also, one
should be familiar with APIs (application programming interface).

SCRUB

There will be almost always some amount of data cleaning (or scrubbing) necessary before analysis of these
data is possible. It is the least sexy part of the analysis process, but often that yields the greatest benefits. A
simple analysis of clean data can be more productive than a complex analysis of noisy and irregular data.

EXPLORE

Visualizing (e.g. histograms and scatter plots), clustering, performing dimensionality reduction (e.g. PCA): these
are all part of ‘looking at data’. No hypothesis is being tested and no predictions are attempted. They are quite
useful for getting to know your data.

MODEL

Often, the ‘best’ model is the most predictive model. One can leave out a fraction of the data (the validation or
test set), learn/optimize a model using the remaining data (the learning or training set) by minimizing a chosen
loss function and evaluate this or another loss function on the validation data → cross validation. Models are
built to predict and to interpret. The former can be assessed quantitively, the latter cannot.

INTERPRET

The predictive power of a model lies in its ability to generalize in the quantitative sense: to make accurate
quantitative predictions of data in new experiments. The interpretability of a model lies in its ability to
generalize in the qualitative sense: to suggest to the modeler which would be the most interesting experiments
to perform next.

CONLUSION

Data science is clearly a blend of the hackers’ arts (primarily in steps “O” and “S” above); statistics and machine
learning (primarily steps “E” and “M” above); and the expertise in mathematics and the domain of the data for
the analysis to be interpretable.




1

,LECTURE 1 - THE DATA SCIENCE VENN DIAGRAM

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

The difficulty in defining these skills is that the split between substance and methodology is ambiguous, and as
such it is unclear how to distinguish among hackers, statisticians, subject matter experts, their overlaps and
where data science fits. It is clear, however, that one needs to learn a lot as they aspire to become a fully
competent data scientist.




HOW TO READ THE DATA SCIENCE VENN DIAGRAM

• Data science consists is interdisciplinary. Hacking skills, math & stats knowledge and substantive
expertise are on their own very valuable, but when combined with only one other are at best simply
not data science, or at worst downright dangerous.
• Hacking skills: Data is a commodity traded electronically. Hence, it is handy to “speak hacker”. Being
able to manipulate text files at the command-line, understanding vectorized operations and thinking
algorithmically are the hacking skills that make for a successful data hacker.
• Math & Statistics Knowledge: Having acquired and cleaned the data, one should get look for insights.
For this, you need to apply appropriate math and statistical methods.
• Substantive Expertise: Science is about discovery and building knowledge, which requires some
motivating questions about eh world and hypotheses that can be brought to data and tested with
statistical methods.
• Danger zone: people who can make a linear regression, but do not know what the coefficients mean.




2

, LECTURE 2 - WHAT IS THE CRISP-DM METHODOLOGY?

https://www.sv-europe.com/crisp-dm-methodology/#dataunderstanding

CRISP-DM stands for cross-industry process for data mining. This methodology provides a structured approach
to planning a data mining project. This model is an idealised sequence of events. In practice many of the tasks
can be performed in a different order and it will often be necessary to backtrack to previous tasks and repeat
certain actions.




STAGE 1: DETERMINE BUSINESS OBJECTIVES


WHAT ARE THE DESIRED OUTPUTS OF THE PROJECT?
1. Set objectives. This means describing your primary objective from a business perspective.
2. Produce project plan. The plan should specify the steps to be performed during the rest of the project,
including the initial selection of tools and techniques.
3. Business success criteria. Here you’ll lay out the criteria that you’ll use to determine whether the project
has been successful from the business point of view. → Specific & measurable.


ASSESS THE CURRENT SITUATION
1. Inventory of resources → personnel, data, computing resources and software.
2. Requirements, assumptions and constraints → e.g. the GDPR and constraints on the availability of
resources.
3. Risks and contingencies → risks that might delay the project.
4. Terminology → compile a glossary of terminology relevant to the project.
5. Costs and benefits → financial measures in a commercial situation.


DETERMINE DATA MINING GOALS
1. Business success criteria → states objectives in business terminology. Describe the intended outputs of
the project that enable the achievement of the business objectives.
2. Data mining success criteria → states project objectives in technical terms, for example: a certain level of
predictive accuracy.


PRODUCE PROJECT PLAN


3

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
berendmarkhorst St Ignatiusgymnasium (Amsterdam)
Bekijk profiel
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
93
Lid sinds
9 jaar
Aantal volgers
85
Documenten
28
Laatst verkocht
2 maanden geleden

Hoi! Ik ben Berend, ik kom uit Amsterdam en ik ben in 2016 (cum laude) afgestudeerd aan het IG (St. Ignatiusgymnasium). Hier heb ik hard voor gewerkt en daar de nodige samenvattingen bij gemaakt. Door middel van deze site kun jij daar nu ook gebruik van maken (en kan ik er m'n lunch tijdens m'n studie mee bekostigen). Groetjes, Berend

3,3

6 beoordelingen

5
1
4
2
3
2
2
0
1
1

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen