DATA SCIEN Predictive Modeling Project.html.
In [205]: from ets import load_boston import pandas as pd import numpy as np import seaborn as sns import t as plt import as sm from _selection import train_test_split from r_model import LinearRegression from er import KMeans from cs import mean_squared_error from ers_influence import variance_inflation_fac tor import math 1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values, Data types, shape, EDA). Perform Univariate and Bivariate Analysis. In [88]: df = _csv("cubic_") () In [89]: () Out[88]: Unnamed: 0 carat cut color clarity depth table x y z price 0 1 0.30 Ideal E SI1 62.1 58.0 4.27 4.29 2.66 499 1 2 0.33 Premium G IF 60.8 58.0 4.42 4.46 2.70 984 2 3 0.90 Very Good E VVS2 62.2 60.0 6.04 6.12 3.78 6289 3 4 0.42 Ideal F VS1 61.6 56.0 4.82 4.80 2.96 1082 4 5 0.31 Ideal F VVS1 60.4 59.0 4.35 4.43 2.65 779 Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD In [90]: In [91]: ibe(include="all").transpose() In [92]: () Out[89]: Unnamed: 0 Unnamed: 0 carat cut color clarity clarity depth table x y z price .11 Premium G SI1 62.3 58.0 6.61 6.52 4.09 5408 .33 Ideal H IF 61.9 55.0 4.44 4.42 2.74 1114 .51 Premium E VS2 61.7 58.0 5.12 5.15 3.17 1656 .27 Very Good F VVS2 61.8 56.0 4.19 4.20 2.60 682 .25 Premium J SI1 62.0 58.0 6.90 6.88 4.27 5166 Out[90]: (26967, 11) Out[91]: count unique top freq mean std min 25% 50% 75% max Unnamed: 0 26967 NaN NaN NaN .85 1 6742..5 26967 carat 26967 NaN NaN NaN 0. 0. 0.2 0.4 0.7 1.05 4.5 cut 26967 5 Ideal 10816 NaN NaN NaN NaN NaN NaN NaN color 26967 7 G 5661 NaN NaN NaN NaN NaN NaN NaN clarity 26967 8 SI1 6571 NaN NaN NaN NaN NaN NaN NaN depth 26270 NaN NaN NaN 61.7451 1.41286 50.8 61 61.8 62.5 73.6 table 26967 NaN NaN NaN 57.4561 2. 59 79 x 26967 NaN NaN NaN 5.72985 1.12852 0 4.71 5.69 6.55 10.23 y 26967 NaN NaN NaN 5.73357 1.16606 0 4.71 5.71 6.54 58.9 z 26967 NaN NaN NaN 3.53806 0. 0 2.9 3.52 4.04 31.8 price 26967 NaN NaN NaN 3939.52 4024. 18818 Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD In [93]: l().sum() In [94]: dups = cated() print('Number of duplicate rows = %d' % (())) <class '.DataFrame'> RangeIndex: 26967 entries, 0 to 26966 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 26967 non-null int64 1 carat 26967 non-null float64 2 cut 26967 non-null object 3 color 26967 non-null object 4 clarity 26967 non-null object 5 depth 26270 non-null float64 6 table 26967 non-null float64 7 x 26967 non-null float64 8 y 26967 non-null float64 9 z 26967 non-null float64 10 price 26967 non-null int64 dtypes: float64(6), int64(2), object(3) memory usage: 2.3+ MB Out[93]: Unnamed: 0 0 carat 0 cut 0 color 0 clarity 0 depth 697 table 0 x 0 y 0 z 0 price 0 dtype: int64 Number of duplicate rows = 0 Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD In [95]: for column in ns: if df[column].dtype == 'object': print((),': ',df[column].nunique()) print(df[column].value_counts().sort_values()) print('n') CUT : 5 Fair 781 Good 2441 Very Good 6030 Premium 6899 Ideal 10816 Name: cut, dtype: int64 COLOR : 7 J 1443 I 2771 D 3344 H 4102 F 4729 E 4917 G 5661 Name: color, dtype: int64 CLARITY : 8 I1 365 IF 894 VVS1 1839 VVS2 2531 VS1 4093 SI2 4575 VS2 6099 SI1 6571 Name: clarity, dtype: int64 Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD In [96]: lot(df['carat'],color='black',rug=True ) In [97]: lot(df['depth'],color='black',rug=True ) Out[96]: <._subplots.AxesSubplot at 0xca0> Out[97]: <._subplots.AxesSubplot at 0x70> Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD In [98]: lot(df['table'],color='black',rug=True ) In [99]: lot(df['price'],color='black',rug=True ) Out[98]: <._subplots.AxesSubplot at 0x16416b42340> Out[99]: <._subplots.AxesSubplot at 0x50> Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD In [100]: lot(df['x'],color='black',rug=True ) In [101]: lot(df['y'],color='black',rug=True ) Out[100]: <matplotl
Written for
- Institution
-
Great Lakes Maritme Academy
- Course
-
DATA SCIEN
Document information
- Uploaded on
- March 16, 2023
- Number of pages
- 82
- Written in
- 2022/2023
- Type
- Exam (elaborations)
- Contains
- Unknown
Subjects
- data scien
-
predictive modeling project