100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Summary Big data Description assignment

Rating
-
Sold
-
Pages
20
Uploaded on
14-08-2023
Written in
2023/2024

Summary of 20 pages for the course  CTEC2921_2223_502 Big Data and Machine Learning at DMU (N/a)











Whoops! We can’t load your doc right now. Try again or contact support.

Document information

Uploaded on
August 14, 2023
Number of pages
20
Written in
2023/2024
Type
Summary

Subjects

Content preview

Table of Contents
INTRODUCTION ...................................................................................................................... 2
Details of the approach: ........................................................................................................ 2
Preparing Data ........................................................................................................................ 2
Data Visualiza+on: ......................................................................................................................................... 5
Selec+on:..................................................................................................................................................... 11
Random Forest: ........................................................................................................................................... 15
Results Analysis: ........................................................................................................................19
Discussion and conclusions:.......................................................................................................19

,INTRODUCTION
The Titanic disaster, which resulted in the loss of many lives, is a widely recognized historical
event. This undertaking aims to utilize classification techniques with the Titanic dataset to
estimate the passengers' survival rate. To achieve this goal, we will divide the task into smaller
tasks, starting with data pre-processing, cleaning, normalization, visualization, and feature
extraction/selection. Finally, we will utilize classification models to predict the survival rate.

Details of the approach:
Our project will use the Python programming language and established libraries such as
pandas, NumPy, Matplotlib, and scikit-learn. Initially, we will import the Titanic dataset and
perform data pre-processing and cleaning to eliminate any duplicates or missing data. The
data will be normalized, which will enhance the performance of the classification models by
ensuring that each feature has a comparable scale. Next, we will conduct data analysis and
visualization to gain insight into the data and identify any patterns that might assist in predicting
the survival rate. Following that, we will employ feature extraction and selection to identify the
most relevant characteristics that are strongly correlated with the target variable.

Preparing Data
df = pd.read_csv("/kaggle/input/titanicdata/TitanicData.csv")

The following code reads a CSV file from the directory
("/kaggle/input/titanicdata/TitanicData.csv") utilizing the pandas library and stores its
contents in a pandas DataFrame object named df. The CSV file is likely to contain the
training set of the Titanic dataset, which includes information on the passengers aboard the
Titanic.

df.describe()

The code snippet commands the pandas DataFrame object, df, to execute the describe()
function. This method generates a summary of statistical measures for the numerical
columns present in the DataFrame, including count, mean, standard deviation, minimum and
maximum values. The output of this function is a table that presents these summary
statistics for each numerical column in the DataFrame.




This code reads and summarizes the Titanic dataset, providing sta5s5cal informa5on on the
numerical characteris5cs of the dataset's training set.


2

, df.isnull()

This code generates a boolean DataFrame with the same shape as df, where a True value
indicates that the cell is null (NaN), and a False value indicates that the cell contains a value.
As a result, the resulting DataFrame contains True values for missing values and False
values for all other cells.

df.isnull().sum()

This code tallies the number of null values in each column of a pandas DataFrame. This
information can be useful in identifying which columns have null values and how many null
values there are in each column.




3
£5.49
Get access to the full document:

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Get to know the seller
Seller avatar
hasibabid29

Get to know the seller

Seller avatar
hasibabid29 De Montfort University
View profile
Follow You need to be logged in order to follow users or courses
Sold
0
Member since
2 year
Number of followers
0
Documents
1
Last sold
-

0.0

0 reviews

5
0
4
0
3
0
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these revision notes.

Didn't get what you expected? Choose another document

No problem! You can straightaway pick a different document that better suits what you're after.

Pay as you like, start learning straight away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and smashed it. It really can be that simple.”

Alisha Student

Frequently asked questions