100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Exam (elaborations)

The Data Analytics Journey D204(for WGU MSDA new path)100% Correct!!

Rating
-
Sold
-
Pages
20
Grade
A+
Uploaded on
30-10-2022
Written in
2022/2023

Data preparation Time data preparation 80%, and everything else falls into about 20% GIGO garbage in, garbage out. That's a truism from computer science. The information you're going to get from your analysis is only as good as the information that you put into it Upside to In-house data It's the fastest way to start., you may actually be able to talk with the people who gathered the data in the first place. Downside to In-house data if it was an ad-hoc project, it may not be well documented. And the biggest one is the data simply may not exist. Maybe what you need really isn't there in your organization. Open data Basically it's data that is free because it has no cost and it's free to use that you can integrate in your projects. Sources: Number one is government data, number two is scientific data and the third one is data from social media and tech companies APIs An API or Application Programming Interface isn't a source of data but rather it's a way of sharing data, it can take data from one application to another. Uses JSON files Scraping data Data scraping is, in a sense, the found art of data science. It's when you take the data that's around you, tables on pages and graphs in newspapers, and integrate that information into your data science work. Unlike the data that's available with API's or Application Programming Interfaces, which is specifically designed for sharing, Data scraping is for data that isn't necessarily created with that integration in mind. Scraping Data and Ethics there's still legal and ethical constraints that you need to be aware of. For instance, you need to respect people's privacy. If the data is private, you still need to maintain that privacy. You need to respect copyright. Just because something's on the web doesn't mean that you can use it for whatever you want. The idea here is Visible Doesn't Mean Open just like in an open market just because it's there in front of you and doesn't have a price tag doesn't mean it's free. There are still these important elements of laws, policies, social practices that need to be maintained to not get yourself in some very serious trouble. And so keep that in mind when you're doing Data scraping. Creating data/Get your own Data natural observation, informal discussions with, for instance, potential clients. You can do this in person in a one on one, or a focus group setting. You can do it online through email, or through chat, and this time you're asking specific questions to get the information you need to focus your own projects.Surveys. Words > Numbers. Let ppl express themselves. Start general Research Ethics when gathering data informed consent,Also sometimes confidentiality, or anonymity Passive collection of training data gathering enormous amounts of data doesn't always involve enormous amounts of work. In certain respects, you can just sit there and wait for it to come to you. Photo Classificaiton. issue with this:One, and this is actually a huge issue, is that you need to ensure that you have adequate representation; things like categorizing photos/ limit cases Self-generated data external reinforcement ative adversarial networks. internal The enumeration of explicit rules business strategies, flowcharts, Or criteria for medical diagnoses. expert system An expert system is an approach to machine decision-making in which algorithms are designed that mimic the decision-making process of a human domain expert. linear regression linear regression, which is a common and powerful technique for combining many variables in an equation to predict a single outcome. decision tree decision tree This is a whole series, a sequence of binary decisions, based on your data, that can combine to predict an outcome. It's called a tree because it branches out from one decision to the next Neural networks look at things in a different way than humans do and in certain situations they're able to develop rules for classification, even when humans can't see anything more than static. implicit rules o the implicit rules are rules that help the algorithms function. They are the rules that they develop by analyzing the test data. And they're implicit because they cannot be easily described to humans. Microsoft Excel and its many versions. Google Sheets spreadsheets the universal data tool. It's my untested theory that there are more datasets in spreadsheets than in any other format in the world. The rows and columns are very familiar to a very large number of people and they know how to explore the data and access it using those tools. The most common by far MLaaS machine learning as a service.Amazon Machine Learning, and Google AutoML, and IBM Watson Analytics, Algebra Number one is that it allows you to scale up. The solution you create to a problem should deal efficiently with many instances at once. Basically create it once, run it many times. And the other one closely related to that is the ability to generalize. Your solution should not apply to just a few specific cases with what's called Magic Numbers, but to cases that vary in a wide range of arbitrary ways, so you want to prepare for as many contingencies as possible Calculus to do a maximization and a minimization, when you're trying to find the balance between these disparate demands. Optimization and the combinatorial explosion You're trying to find an optimum solution, but randomly going through every possibility doesn't work. This is called the combinatorial explosion because the growth is explosive as the number of units and the number of possibilities rises and so you need to find another way that can save you some time and still help you find an optimum solution. Bayes' theorem What Bayes' Theorem does is it gives you the posterior or after-the-data probability of a hypothesis as a function of the likelihood of the data given the hypothesis, the prior probability of the hypothesis and the probability of getting the data you found. Descriptive analyses descriptive analyses are one way of doing this. It's a little like cleaning up the mess in your data to find clarity in the meaning of what you have. And I like to think that there are three very general steps to descriptive statistics. Number one, visualize your data, make a graph and look at it. Number two, compute univariate descriptive statistics. There's things like the mean. It's an easy way of looking at one variable at a time. And then go on to measures of association, or the connection between the variables in your data. Steps for Descriptive Analyses looking at your data through charts, i.e Historgram. skews *positively-skewed distributions ie Think of the valuations at companies, the cost of houses. negative skew, where most of the people are at the high end and the trailing ones are at the low end. If you think of something like birth weight. U-shaped distribution:polarizing movie and the reviews that it gets* what's a univariate descriptive you can look for one number that might be able to represent the entire collection. measures that give you a numerical description of association correlation coefficient or regression analysis Predictive models Find and use relevant past data. It doesn't have to be really old. It can be data from yesterday. But you always have to use data in the past because that's the only data you can get. And then you model the outcome using any of many possible choices. Predictive Model Critical Step validate your model by testing it against new data, often against data that's been set aside for this very purpose. This is the step that's often neglected in a lot of scientific research, but it's nearly universal in predictive analytics and it's a critical part of making sure that your model works well outside of the constraints of the data that you had available predictive analytics One of them is trying to predict future events, and that's using presently available data to predict something that will happen later in the future, or use past medical records to predict future health. The other possibly more common use is using prediction to refer to alternative events, that is, approximating how a human would perform the same task. So you're going to have a machine do something like classifying photos and you want to see whether this is a person decomposition breaking things down from the whole into their elements, to try to see what's happening with your data over time. This is decomposition. Think of it like disassembling a clock or some other item. You're going to take the trend over time and break it down into several separate elements. You're going to look at the overall trend, you're going to look at seasonal or a cyclical trend, and you're going to have some leftover random noise Clustering You can look at things like a K-dimensional space. So you locate each data point, each observation, in a multidimensional space with K-dimensions for K variables. So if you have five dimensions, K is five. If you have 500, then you have 500 dimensions. What you need to do then, is you need to find a way to measure the distance between each point, and you're going to do one point, every other point, and you're looking for clumps and gaps. Cluster Analysis Methods hierarchical clustering,K-means, or a group centroid model. You can use density models or distribution models, or a linkage clustering model Classifying Locate the case in a k-dimensional space where k is the number of variables or different kinds of information that you have. And there's probably going to be more than three. It might be hundreds or thousands. But once you get it located in that space, compare the labels on nearby data, that of course assuming that other data already has labels that it says whether it's a photo of a cat, or a dog, or a building. And then once you've done that, assign the new case to the same category. LOCATE,COMPARE, ASSIGN Classifying Methods K-means, k nearest neighbors. BInary Classificaiton. Many Categories, Distance Measures. Anomaly detection tese are cases that are distant from the others in a multidimensional space. They also can be cases that don't follow an expected pattern or a trend over time, or in the case of fraud, they may be cases that match known anomalies or other fraudulent cases. Dimensionality reduction reduce the number of variables and the amount of data that you're dealing with. ach variable, each factor or feature has error associated with it. It doesn't measure exactly what you want. It brings in some other stuff. But when you have many variables or features together that you combine, the errors tend to cancel out. Dimensionality reduction Methods Number one is principal component analysis, often just called principal components or PCA. And the idea here is that you take your multiple correlated variables and you combine them into a single component score. Another very common approach is factor analysis. And functionally, it works exactly the same way. People use it for the same thing, although the philosophy behind factor analysis is very different. Here your goal is to find the underlying common factor that gives rise to multiple indicators. PCA vs FA in principal component analysis, the variables come first and the component results from it. In factor analysis, this hidden factor comes first and gives rise to the individual variables. Feature selection and creation Dimension reduction's often used as a part of getting the data ready so you can then start looking at which features to include in the models you're creating. Feature selection and creation Methods just basic ise regression,ridge regression Validating models check your work.The basic principle is pretty easy, even if people outside of data science don't do it very often. Validating models methods Take your data and split it up into two groups. Training data and testing data.*cross-validation*. Now you can say it's testing data but it's actually using the training data. What you do here is you take the training data and you split it up into several pieces, maybe six different groups and then you use five at a time to build a model and then you use the sixth group to test it and then you rotate through a different set of five and you verify them against a different one sixth of the data and so on an so forth. Interpretability The point of all this, is it in your analysis, interpretability is critical. You're telling a story and you need to be able to make sense of your findings, so you can make reasonable and justifiable recommendations. Part to whole charts Pie charts/bar charts distribution charts frequency tables Predictive analytics which future events are the most likely. why big data projects fail? poor organization as the biggest factor if you don't have access to all the data you want it's often an effective way to figure out a better or more complete set of data to collect in the future. Often, more data will come in as you're analyzing, and your question must change to reflect this addition. Sometimes data isn't clean, which may put a time crunch on your project if you don't take this into account.

Show more Read less
Institution
WGU D204
Course
WGU D204










Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
WGU D204
Course
WGU D204

Document information

Uploaded on
October 30, 2022
Number of pages
20
Written in
2022/2023
Type
Exam (elaborations)
Contains
Questions & answers

Subjects

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
EvaTee Phoenix University
View profile
Follow You need to be logged in order to follow users or courses
Sold
4978
Member since
4 year
Number of followers
3554
Documents
50910
Last sold
2 hours ago
TIGHT DEADLINE? I CAN HELP

Many students don\'t have the time to work on their academic papers due to balancing with other responsibilities, for example, part-time work. I can relate. kindly don\'t hesitate to contact me, my study guides, notes and exams or test banks, are 100% graded

3.9

907 reviews

5
434
4
160
3
164
2
45
1
104

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions