Ans 1: Numpy is a library that consists of arrays and numerical methods. Pandas is a library for processing data, particularly in a cell format. Seaborn is a library for plot visuals. Matplotlib.pyplot is a library for generating plots and visuals.
Ans 3: There are 9 columns.
Ans 5: The dimension of the dataset is obtained via the shape method and outputs the number of rows and columns that the dataset has.
Ans 6: The size of the dataset is number of fields in the dataset computed by rows*columns.
Ans 7: There are a total of 9 variables (1 per column): Two of them are float type: BMI and DiabetesPedigreeFunction. Seven of them are integer type: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, Age and Outcome.
Ans 8: False there are no missing values in the Pima dataframe. This was also seen in the non-null column count under the pima.info() method in Q7.
Ans 9: The summary statistics of the data show the count, mean, and standard deviation of each of the columns (besides Outcome). It also quickly outputs the min/max as well as the 25th, 50th and 75th percentile.
Ans 10: The distribution plot shows that BloodPressure exhibits behaviour similar to a normal distribution.
Ans 11: The BMI of the person having the highest Glucose is 42.9.
Ans 12: The mean, median and mode of BMI are 32.45080515543617, 32.0 and 32.0. These three measures of tendency are very close with median and mode being the same.
Ans 13: There are 343 women with Glucose levels above the mean Glucose of 121.675781.
Ans 14: There are 22 women with Blood Pressure that is equal to its median and BMI that is less than its median.
Ans 15: From the pairplot the most apparent observation is that the higher the glucose level the more likely the woman is to be diabetic. This is because on the top row of the pairplot there are more orange dots as glucose increases on the y-axis. There isn't much correlation between skin thickness or diabetes pedigree function and the outcome.
Ans 16: This scatterplot shows that as glucose levels increase the insulin tends to increase. This is however only a small correlation as there are a high number of women with increasing glucose levels that have the exact same insulin levels, as shown by the "horizontal line" formed by the dots.
Ans 17: Yes there are outliers as denoted by the dots above age 65.
Ans 18: The two histograms show that both those who do and do not have diabetes decreases with age. This would be attributed to the fact that people are more likely to die as they get older. However the proportion of women who do not have diabetes and are in their 20s and early 30s are much more likely, shown as a positive skew on the graph. This drastically drops down in their early 30s. The distribution for women who do have diabetes decreases with age in a more linear fashion.
Ans 19: The interquartile range shows the spread of the middle 50% of the data. This allows data scientists to see how spread out the core of their data is. The boxplot in Q17 visualizes this for Age.
Ans 20: The correlation matrix and subsequent heatmap allow for the correlations between two variables to be visually compared. For example, Age and Pregnancies are around 54% correlated meaning that pregancies increase with age half the time which makes sense as most women choose to have children in the middle of their life and not when they're very young or very old. BMI and skin thickness are also correlated in a similar way. Blood pressure and BMI as well as with Age show slight correlation (33% and 28%). The squares in black are not correlated at all; notably blood pressure and insulin.