Mobile_Internet_Case_Study¶



Context¶


ExperienceMyServices reported that a typical American spends an average of 144 minutes (2.4 hours) per day accessing the Internet via a mobile device with a standard deviation of 110 minutes.

To test the validity of this statement, you collected 30 samples from friends and family. The results for the time spent per day accessing the Internet via a mobile device (in minutes) are stored in "InternetMobileTime.csv".


Key Question¶


Is there enough statistical evidence to conclude that the population mean time spent per day accessing the Internet via mobile device is different from 144 minutes? Use the p-value approach and a level of significance of 0.05.

Note: We can assume that the samples are randomly selected, independent, and come from a normally distributed population.

Importing the necessary libraries¶

In [1]:
# Import the important packages
import pandas as pd  # Library used for data manipulation and analysis

import numpy as np  # Library used for working with arrays

import matplotlib.pyplot as plt  # Library for visualization

import seaborn as sns  # Library for visualization

%matplotlib inline

import scipy.stats as stats  # This library contains a large number of probability distributions as well as a growing library of statistical functions

Loading the Data¶

In [2]:
mydata = pd.read_csv('InternetMobileTime.csv')

mydata.head()
Out[2]:
Minutes
0 72
1 144
2 48
3 72
4 36
In [3]:
mydata.shape
Out[3]:
(30, 1)

Steps of Hypothesis Testing¶

Step 1: Define the null and the alternate hypotheses¶

Null hypothesis states that the mean Internet usage time, $\mu$ is equal to 144. Alternative hypothesis states that the mean Internet usage time, $\mu$ is not equal to 144.

  • $H_0$: $\mu$ = 144
  • $H_a$: $\mu$ $\neq$ 144

Step 2: Decide the significance level¶

Here, we are given that $\alpha$ = 0.05.

In [4]:
print("The sample size for this problem is", len(mydata))
The sample size for this problem is 30

Step 3: Identify the test statistic¶

The population is normally distributed and the population standard deviation is known to be equal to 110. So, we can use the Z-test statistic.

Step 4: Calculate the p-value using z-statistic¶

In [6]:
sample_mean = mydata["Minutes"].mean()
In [7]:
# Calculating the z-stat

n = 30
mu = 144  
sigma = 110

test_stat =  (sample_mean - mu) / (sigma / np.sqrt(n)) 
In [8]:
test_stat
Out[8]:
1.8157832663959144
In [9]:
from scipy.stats import norm

# The p-value for one-tailed test
p_value1 = 1 - norm.cdf(test_stat)

# We can find the p_value for the the two-tailed test from the one-tailed test
p_value_ztest = p_value1 * 2
In [10]:
print('The p-value is: {0} '.format(p_value_ztest))
The p-value is: 0.06940362517785204 

Step 5: Decide to reject or fail to reject the null hypothesis based on the z-statistic¶

In [11]:
alpha_value = 0.05 # Level of significance

print('Level of significance: %.2f' %alpha_value)

if p_value_ztest < alpha_value: 
    print('We have the evidence to reject the null hypothesis as the p-value is less than the level of significance'.format(p_value_ztest))
else:
    print('We do not have sufficient evidence to reject the null hypothesis as the p-value is greater than the level of significance'.format(p_value_ztest)) 
Level of significance: 0.05
We do not have sufficient evidence to reject the null hypothesis as the p-value is greater than the level of significance

We have calculated the z-statistic, which works on the assumption that population standard deviation is known but in real life, this assumption is very unlikely, and to deal with this problem there is another test called t-statistic, which is similar to z-statistic, with the assumption that population standard deviation is not known and sample standard deviation is used to calculate the test statistic.

We will use scipy.stats.ttest_1samp which calculates the t-test for the mean of one sample given the sample observations. This function returns the t statistic and the p-value for a two-tailed t-test.

Step 6: Calculate the p-value using t-statistic¶

In [11]:
t_statistic, p_value_ttest = stats.ttest_1samp(mydata, popmean = 144)
print('One sample t-test \nt statistic: {0} p value: {1} '.format(t_statistic, p_value_ttest))
One sample t-test 
t statistic: [1.41131966] p value: [0.16878961] 

Step 7: Decide to reject or not to reject the null hypothesis based on t-statistic¶

In [12]:
alpha_value = 0.05 # Level of significance

print('Level of significance: %.2f' %alpha_value)

if p_value_ttest < alpha_value: 
    print('We have the evidence to reject the null hypothesis as the p-value is less than the level of significance'.format(p_value_ttest))
else:
    print('We do not have sufficient evidence to reject the null hypothesis as the p-value is greater than the level of significance'.format(p_value_ttest)) 
Level of significance: 0.05
We do not have sufficient evidence to reject the null hypothesis as the p-value is greater than the level of significance

Observation:

  • At a 5% significance level, we do not have enough statistical evidence to prove that the mean time spent on the Internet is not equal to 144 minutes.