Uber Data Analysis¶



Context¶


Uber Technologies, Inc. is an American multinational transportation network company based in San Francisco and has operations in approximately 72 countries and 10,500 cities. In the fourth quarter of 2021, Uber had 118 million monthly active users worldwide and generated an average of 19 million trips per day.

Ridesharing is a very volatile market and demand fluctuates wildly with time, place, weather, local events, etc. The key to being successful in this business is to be able to detect patterns in these fluctuations and cater to the demand at any given time.

As a newly hired Data Scientist in Uber's New York Office, you have been given the task of extracting insights from data that will help the business better understand the demand profile and take appropriate actions to drive better outcomes for the business. Your goal is to identify good insights that are potentially actionable, i.e., the business can do something with it.


Objective¶


To extract actionable insights around demand patterns across various factors.


Key Questions¶


  1. What are the different variables that influence pickups?
  2. Which factor affects the pickups the most? What could be plausible reasons for that?
  3. What are your recommendations to Uber management to capitalize on fluctuating demand?

Dataset Description¶


The data contains information about the weather, location, and pickups.

  • pickup_dt: Date and time of the pick-up
  • borough: NYC's borough
  • pickups: Number of pickups for the period (1 hour)
  • spd: Wind speed in miles/hour
  • vsb: Visibility in miles to the nearest tenth
  • temp: Temperature in Fahrenheit
  • dewp: Dew point in Fahrenheit
  • slp: Sea level pressure
  • pcp01: 1-hour liquid precipitation
  • pcp06: 6-hour liquid precipitation
  • pcp24: 24-hour liquid precipitation
  • sd: Snow depth in inches
  • hday: Being a holiday (Y) or not (N)

Importing the necessary libraries and overview of the dataset¶

In [1]:
# Library to suppress warnings

import warnings
warnings.filterwarnings('ignore')
In [2]:
# Libraries to help with reading and manipulating data

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

# Libraries to help with data visualization

import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline

# Library to extract datetime features
import datetime as dt

Loading the dataset¶

In [3]:
data = pd.read_csv('Uber.csv')
In [4]:
# Copying data to another variable to avoid any changes to the original data
df = data.copy()

View the first 5 rows of the dataset¶

In [5]:
# Looking at head (the first 5 observations) 
df.head()
Out[5]:
pickup_dt borough pickups spd vsb temp dewp slp pcp01 pcp06 pcp24 sd hday
0 2015-01-01 01:00:00 Bronx 152 5.0 10.0 30.0 7.0 1023.5 0.0 0.0 0.0 0.0 Y
1 2015-01-01 01:00:00 Brooklyn 1519 5.0 10.0 30.0 7.0 1023.5 0.0 0.0 0.0 0.0 Y
2 2015-01-01 01:00:00 EWR 0 5.0 10.0 30.0 7.0 1023.5 0.0 0.0 0.0 0.0 Y
3 2015-01-01 01:00:00 Manhattan 5258 5.0 10.0 30.0 7.0 1023.5 0.0 0.0 0.0 0.0 Y
4 2015-01-01 01:00:00 Queens 405 5.0 10.0 30.0 7.0 1023.5 0.0 0.0 0.0 0.0 Y

Observations:

  • The column pickup_dt includes the pickup date and time. The date shows that the data starts from 01-Jan-2015.
  • The column borough contains the name of the New York borough in which the pickup was made.
  • The column pickups contain the number of pickups in the borough at the given time.
  • All of the weather variables are numerical.
  • The variable holiday is a categorical variable.

View the last 5 rows of the dataset¶

In [6]:
# Looking at tail (the last 5 observations) 
df.tail()
Out[6]:
pickup_dt borough pickups spd vsb temp dewp slp pcp01 pcp06 pcp24 sd hday
29096 2015-06-30 23:00:00 EWR 0 7.0 10.0 75.0 65.0 1011.8 0.0 0.0 0.0 0.0 N
29097 2015-06-30 23:00:00 Manhattan 3828 7.0 10.0 75.0 65.0 1011.8 0.0 0.0 0.0 0.0 N
29098 2015-06-30 23:00:00 Queens 580 7.0 10.0 75.0 65.0 1011.8 0.0 0.0 0.0 0.0 N
29099 2015-06-30 23:00:00 Staten Island 0 7.0 10.0 75.0 65.0 1011.8 0.0 0.0 0.0 0.0 N
29100 2015-06-30 23:00:00 NaN 3 7.0 10.0 75.0 65.0 1011.8 0.0 0.0 0.0 0.0 N

Observations:

  • The head indicated that the data began on January 1, 2015, whereas the tail indicates that it continued until June 30, 2015. This means we have six months' worth of data to analyze.

Checking the shape of the dataset¶

In [7]:
df.shape
Out[7]:
(29101, 13)
  • The dataset has 29,101 rows and 13 columns.

Checking the info()¶

In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29101 entries, 0 to 29100
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pickup_dt  29101 non-null  object 
 1   borough    26058 non-null  object 
 2   pickups    29101 non-null  int64  
 3   spd        29101 non-null  float64
 4   vsb        29101 non-null  float64
 5   temp       29101 non-null  float64
 6   dewp       29101 non-null  float64
 7   slp        29101 non-null  float64
 8   pcp01      29101 non-null  float64
 9   pcp06      29101 non-null  float64
 10  pcp24      29101 non-null  float64
 11  sd         29101 non-null  float64
 12  hday       29101 non-null  object 
dtypes: float64(9), int64(1), object(3)
memory usage: 2.9+ MB

Observations:

  • All columns have 29,101 observations except borough, which has 26,058 observations indicating that there are null values in it.
  • pickup_dt is read as an 'object' data type, but it should have the data type as DateTime.
  • borough and hday (holiday) should be categorical variables.

Summary of the data¶

In [9]:
df.describe().T
Out[9]:
count mean std min 25% 50% 75% max
pickups 29101.0 490.215903 995.649536 0.0 1.0 54.0 449.000000 7883.00
spd 29101.0 5.984924 3.699007 0.0 3.0 6.0 8.000000 21.00
vsb 29101.0 8.818125 2.442897 0.0 9.1 10.0 10.000000 10.00
temp 29101.0 47.669042 19.814969 2.0 32.0 46.0 64.500000 89.00
dewp 29101.0 30.823065 21.283444 -16.0 14.0 30.0 50.000000 73.00
slp 29101.0 1017.817938 7.768796 991.4 1012.5 1018.2 1022.900000 1043.40
pcp01 29101.0 0.003830 0.018933 0.0 0.0 0.0 0.000000 0.28
pcp06 29101.0 0.026129 0.093125 0.0 0.0 0.0 0.000000 1.24
pcp24 29101.0 0.090464 0.219402 0.0 0.0 0.0 0.050000 2.10
sd 29101.0 2.529169 4.520325 0.0 0.0 0.0 2.958333 19.00
  • There is a significant discrepancy between the third quartile and the highest value for the number of pickups (pickups) and the snow depth (sd), indicating that these variables may have outliers to the right.
  • The temperature has a broad range, showing that the data includes records from the winter as well as summer seasons.

By default, the describe() function shows the summary of numeric variables only. Let's check the summary of non-numeric variables.

In [10]:
df.describe(exclude = 'number').T
Out[10]:
count unique top freq
pickup_dt 29101 4343 2015-01-01 01:00:00 7
borough 26058 6 Bronx 4343
hday 29101 2 N 27980

Observations:

  • The variable 'borough' has six unique categories. The category Bronx has occurred 4,343 times in the data.
  • The variable 'hday' has 2 unique categories. The category N, i.e., not a holiday as occurred more often, which makes sense.

Let's check the count of each unique category in each of the categorical variables.

In [11]:
# Making a list of all categorical variables 
cat_col = ['borough', 'hday']

# Printing number of count of each unique value in each column

for column in cat_col:
    print(df[column].value_counts())
    
    print('-' * 50)
Bronx            4343
Brooklyn         4343
EWR              4343
Manhattan        4343
Queens           4343
Staten Island    4343
Name: borough, dtype: int64
--------------------------------------------------
N    27980
Y     1121
Name: hday, dtype: int64
--------------------------------------------------
  • The above output shows that the borough variable has an equal count for each category.

Extracting date parts from pickup date¶

In [12]:
# Converting pickup_dt datatype to datetime 

df.pickup_dt = pd.to_datetime(df.pickup_dt)

# Extracting date parts from pickup_dt

df['start_year'] = df.pickup_dt.dt.year

df['start_month'] = df.pickup_dt.dt.month_name()

df['start_hour'] = df.pickup_dt.dt.hour

df['start_day'] = df.pickup_dt.dt.day

df['week_day'] = df.pickup_dt.dt.day_name()
In [13]:
# Removing pickup_dt column as it will not be required for further analysis

df.drop('pickup_dt', axis = 1, inplace = True)
In [14]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29101 entries, 0 to 29100
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   borough      26058 non-null  object 
 1   pickups      29101 non-null  int64  
 2   spd          29101 non-null  float64
 3   vsb          29101 non-null  float64
 4   temp         29101 non-null  float64
 5   dewp         29101 non-null  float64
 6   slp          29101 non-null  float64
 7   pcp01        29101 non-null  float64
 8   pcp06        29101 non-null  float64
 9   pcp24        29101 non-null  float64
 10  sd           29101 non-null  float64
 11  hday         29101 non-null  object 
 12  start_year   29101 non-null  int64  
 13  start_month  29101 non-null  object 
 14  start_hour   29101 non-null  int64  
 15  start_day    29101 non-null  int64  
 16  week_day     29101 non-null  object 
dtypes: float64(9), int64(4), object(4)
memory usage: 3.8+ MB

Missing value treatment¶

In [15]:
# Checking missing values

df.isna().sum()
Out[15]:
borough        3043
pickups           0
spd               0
vsb               0
temp              0
dewp              0
slp               0
pcp01             0
pcp06             0
pcp24             0
sd                0
hday              0
start_year        0
start_month       0
start_hour        0
start_day         0
week_day          0
dtype: int64
  • There are 3043 missing values for the variable borough.
  • Other variables have no missing values.
In [16]:
# Checking the missing values further

df.borough.value_counts(normalize = True, dropna = False)
Out[16]:
Bronx            0.149239
Brooklyn         0.149239
EWR              0.149239
Manhattan        0.149239
Queens           0.149239
Staten Island    0.149239
NaN              0.104567
Name: borough, dtype: float64
  • All the six categories have the same percentage, i.e., ~15%. There is no mode (or multiple modes) for this variable.
  • The percentage of missing values is close to the percentage of observations from other boroughs.
  • We can treat the missing values as a separate category for this variable.
In [17]:
# Replacing NaN with Unknown

df['borough'].fillna('Unknown', inplace = True) 
In [18]:
df.borough.value_counts()
Out[18]:
Bronx            4343
Brooklyn         4343
EWR              4343
Manhattan        4343
Queens           4343
Staten Island    4343
Unknown          3043
Name: borough, dtype: int64
In [19]:
df.isnull().sum()
Out[19]:
borough        0
pickups        0
spd            0
vsb            0
temp           0
dewp           0
slp            0
pcp01          0
pcp06          0
pcp24          0
sd             0
hday           0
start_year     0
start_month    0
start_hour     0
start_day      0
week_day       0
dtype: int64
  • Now, there are no missing values in the data.

Exploratory Data Analysis: Univariate¶

Let us explore the numerical variables first.

In [20]:
# While doing a univariate analysis of numerical variables, we want to study their central tendency and dispersion

# Let us write a function that will help us create a boxplot and histogram for any numerical variable

# This function takes the numerical variable as the input and returns the boxplots and histograms for that variable

# This would help us write faster and cleaner code

def histogram_boxplot(feature, figsize = (15,10), bins = None):
    """ Boxplot and histogram combined
    feature: 1-d feature array
    figsize: size of fig (default (9,8))
    bins: number of bins (default None / auto)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2,     # Number of rows of the subplot grid
                                           sharex = True, # The X-axis will be shared among all the subplots
                                           gridspec_kw = {"height_ratios": (.25, .75)}, 
                                           figsize = figsize 
                                           ) 
    # Creating the subplots
    
    # Boxplot will be created and the mean value of the column will be indicated using some symbol
    sns.boxplot(feature, ax = ax_box2, showmeans = True, color ='red')
    
    # For histogram
    sns.distplot(feature, kde = F, ax = ax_hist2, bins = bins) if bins else sns.distplot(feature, kde = False, ax = ax_hist2)
    
    ax_hist2.axvline(np.mean(feature), color = 'g', linestyle = '--')      # Add mean to the histogram
    
    ax_hist2.axvline(np.median(feature), color = 'black', linestyle = '-') # Add median to the histogram

Observations on Pickups¶

In [21]:
histogram_boxplot(df.pickups)

Observations:

  • The distribution of hourly pickups is highly right-skewed.
  • The majority of the hourly pickups are close to 0.
  • Median pickups are equal to 0, but the mean is ~500.
  • There are a lot of outliers in this variable.
  • While most hourly pickups are at the lower end, we have observations where hourly pickups went as high as 8000.

Observations on Visibility¶

In [22]:
histogram_boxplot(df.vsb)

Observations:

  • The distribution of 'visibility' is left-skewed.
  • Both the mean and the median are high, indicating that the visibility is good on most days.
  • There are, however, outliers towards the left, indicating that visibility is extremely low on some days.
  • It will be interesting to see how visibility affects the Uber pickup frequency.

Observations on Temperature¶

In [23]:
histogram_boxplot(df.temp)

Observations:

  • Temperature does not have any outliers.
  • The distribution of the temperature has two peaks (Bi-modal), one at around 35F and the other at around 60F. The hump is greater at 35F (~1.5 C) indicating cold weather conditions.

Observations on Dew Point¶

In [24]:
histogram_boxplot(df.dewp)

Observations:

  • There are no outliers for dew point either.
  • The distribution is similar to that of temperature. It suggests a possible correlation between the two variables.
  • Dew point is an indication of humidity, which is correlated with temperature.

Observations on Sea Level Pressure¶

In [25]:
histogram_boxplot(df.slp)

Observations:

  • Sea level pressure distribution looks approximately normal.
  • There are a few outliers on both ends.

Observations on Snow Depth¶

In [26]:
histogram_boxplot(df.sd)

Observations:

  • We observe that there is a snowfall in the period that we are analyzing.
  • There are outliers in this variable.
  • We will have to see how snowfall affects pickups. We know that very few people are likely to get out if it is snowing heavily, so our pickups would likely decrease when it snows.

Now, let's explore the categorical variables.

In [27]:
# Function to create barplots that indicates percentage for each category

def bar_perc(data, z):
    
    total = len(data[z]) # Length of the column
    
    plt.figure(figsize = (15, 5))
    
    # plt.xticks(rotation = 45)
    
    ax = sns.countplot(data[z], palette = 'Paired')
    
    for p in ax.patches:
        percentage = '{:.1f}%'.format(100 * p.get_height() / total) # Percentage of each class
        
        x = p.get_x() + p.get_width() / 2 - 0.05                    # Width of the plot
        
        y = p.get_y() + p.get_height()                              # Height of the plot
        
        ax.annotate(percentage, (x, y), size = 12)                  # Annotate the percentage 
        
    plt.show()                                                      # Display the plot

Observations on holiday¶

In [28]:
bar_perc(df, 'hday')

Observation:

  • Only 3.9% of days were holidays in the period that we are analyzing.

Observations on borough¶

In [29]:
bar_perc(df, 'borough')

Observation:

  • The observations are uniformly distributed across the boroughs except for the observations that had NaN values and were attributed to the Unknown borough.

Exploratory Data Analysis: Multivariate¶

Let's plot multivariate charts between variables to understand their interaction with each other.

Correlation¶

In [30]:
# Check for correlation among numerical variables

num_var = ['pickups', 'spd', 'vsb', 'temp', 'dewp', 'slp', 'pcp01', 'pcp06', 'pcp24', 'sd']

corr = df[num_var].corr()

# Plot the heatmap

plt.figure(figsize = (14, 10))

sns.heatmap(corr, annot = True, cmap = 'coolwarm',
            
        fmt = ".1f",
            
        xticklabels = corr.columns,
            
        yticklabels = corr.columns)
Out[30]:
<AxesSubplot:>

Observations:

  • As expected, temperature shows a high correlation with dew point.
  • Visibility is negatively correlated with precipitation. If the rains are high during the hour, visibility is low. This is aligned with our intuitive understanding.
  • Snow depth, ofcourse, would be negatively correlated with the temperature.
  • The wind speed and the sea level pressure are negatively correlated with the temperature.
  • It is important to note that correlation does not imply causation.
  • There does not seem to be a strong relationship between the number of pickups and weather stats.

Pair Plot¶

In [31]:
sns.pairplot(df[num_var], corner = True)

plt.show()

Observations:

  • The correlation plot provides the same insights.
  • As observed earlier, there does not seem to be a strong relationship between the number of pickups and weather stats.

Relationship between pickups and time based variables¶

Pickups across Months¶

In [32]:
cats = df.start_month.unique().tolist()

df.start_month = pd.Categorical(df.start_month, ordered = True, categories = cats)

plt.figure(figsize = (20, 7))

sns.lineplot(x = "start_month", y = "pickups", data = df, ci = 0, color = "RED", estimator = 'sum')

plt.ylabel('Total pickups')

plt.xlabel('Month')

plt.show()

Observations:

  • There is a clear increasing trend in monthly bookings.
  • Bookings in June are almost 1.5 times that of Jan.

Pickups vs Days of the Month¶

In [33]:
plt.figure(figsize = (20, 7))

sns.lineplot(x = "start_day", y = "pickups", estimator = 'sum', ci = 0, data = df, color = "RED")

plt.ylabel('Total pickups')

plt.xlabel('Day of Month')

plt.show()

Observations:

  • There is a steep fall in the bookings on the last day of the month.
  • This can partially be attributed to Feb having just 28 days. We can drop Feb and have a look at this chart again.
  • There is a peak in the bookings around the 20th day of the month.
In [34]:
# Let's drop the Feb month and visualize again

df_not_feb =  df[df['start_month'] != 'February']

plt.figure(figsize = (20, 7))

sns.lineplot(x = "start_day", y = "pickups", estimator = 'sum', ci = 0, data = df_not_feb, color = "RED")

plt.ylabel('Total pickups')

plt.xlabel('Day of Month')

plt.show()

Observations:

  • We observe the expected increase in the relative position of the number of pickups from 29th to 30th.
  • Number of pickups for 31 is still low because not all months have the 31st day.

Pickups across Weekdays¶

In [35]:
cats = ['Monday', 'Tuesday', 'Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday']

df.week_day = pd.Categorical(df.week_day, ordered = True, categories = cats)

plt.figure(figsize = (20, 7))

sns.lineplot(x = "week_day", y = "pickups", ci = 0, data = df, color = "RED")

plt.ylabel('Mean pickups')

plt.xlabel('Weeks')

plt.show()

Observations:

  • Pickups gradually increase as the week progresses and starts dropping after Saturday.
  • We need to do more investigation to understand why the demand for Uber is low at the beginning of the week.

Pickups across Boroughs¶

In [36]:
plt.figure(figsize = (20, 10))  

sns.boxplot(df['borough'], df['pickups'])

plt.ylabel('pickups')

plt.xlabel('Borough')

plt.show()

Observations:

  • There is a clear difference in ridership across the different boroughs.
  • Manhattan has the highest number of bookings.
  • Brooklyn and Queens are distant followers.
  • EWR, Unknown, and Staten Island have a very low number of bookings. The demand is so small that probably it can be covered by the drop-offs of the inbound trips from other areas.

Relationship between Pickups and Holidays¶

In [37]:
df.groupby('hday')['pickups'].mean()
Out[37]:
hday
N    492.339957
Y    437.199822
Name: pickups, dtype: float64
In [38]:
# Check if the trend is similar across boroughs

df.groupby(by = ['borough','hday'])['pickups'].mean()
Out[38]:
borough        hday
Bronx          N         50.771073
               Y         48.065868
Brooklyn       N        534.727969
               Y        527.011976
EWR            N          0.023467
               Y          0.041916
Manhattan      N       2401.302921
               Y       2035.928144
Queens         N        308.899904
               Y        320.730539
Staten Island  N          1.606082
               Y          1.497006
Unknown        N          2.057456
               Y          2.050420
Name: pickups, dtype: float64

Observations:

  1. The mean pickups on holidays are lesser than a non-holiday.
  2. Except for Manhattan, mean pickups on holidays are pretty similar to non-holiday pickups.
  3. In Queens, mean pickups on holidays are higher.
  4. There are hardly any pickups in EWR.

Relationship between Pickups and Hour of the day across Boroughs¶

In [39]:
plt.figure(figsize = (20, 7))

sns.lineplot(x = "start_hour", y = "pickups", ci = 0, data = df, hue = 'borough')

plt.ylabel('Pickups')

plt.xlabel('Hour of the day')

plt.show()

Observations:

  • Bookings peak around the 19th and 20th hour of the day and decreases till 5 AM.
  • The peak can be attributed to the time people leave their workplaces.
  • From 5 AM onwards, we can see an increasing trend till 10, possibly the office rush.
  • Pickups go down from 10 AM to 12 PM post that they start increasing.
  • The number of pickups in Manhattan is very high and dominant when we see the spread across boroughs.
  • We cannot observe the distribution for EWR and Staten Island boroughs in this plot due to the very low count in these boroughs. Let's try converting the pickups on a logarithmic scale to visualize all the boroughs.
In [40]:
plt.figure(figsize = (20, 7))

sns.lineplot(x = df.start_hour, y = np.log1p(df.pickups), estimator ='sum', ci = 0, hue = df.borough)

plt.ylabel('Total pickups')

plt.xlabel('Hour of the day')

plt.legend(bbox_to_anchor = (1, 1))

plt.show()

Observations:

  • Hourly pattern can be seen in almost all the boroughs.
  • After applying the logarithmic scale, it is obvious that the four major boroughs follow the same pattern.
  • EWR seems to have a random demand with a majority of the values being zero with a few 1s and 2s.
  • Manhattan sees the most Uber pickups. Let us explore this borough in more detail.

Manhattan Pickups Heatmap - Weekday vs Hour¶

In [41]:
df_man = df[df.borough == 'Manhattan']

df_hm = df_man.pivot_table(index = 'start_hour', columns = 'week_day', values = 'pickups')

# Draw a heatmap

plt.figure(figsize = (20, 10)) # To resize the plot

sns.heatmap(df_hm,  fmt = "d", cmap = 'coolwarm', linewidths = .5, vmin = 0)

plt.show()

Observations:

  • The demand for Uber peaks during the late hours of the day when people are returning home from the office.
  • Demand continues to be high during the late hours of the day (midnight) on Fridays and Saturdays.
  • It is odd that the demand for Uber is not as high on Monday evenings in comparison to other working days.

Let us see if a similar trend exists in Brooklyn¶

In [42]:
df_br = df[df.borough == 'Brooklyn']

df_hm = df_br.pivot_table(index = 'start_hour', columns = 'week_day', values = 'pickups')

# Draw a heatmap 
plt.figure(figsize = (20, 10)) # To resize the plot

sns.heatmap(df_hm,  fmt = "d", cmap = 'coolwarm', linewidths = .5, vmin = 0)

plt.show()
  • In Brooklyn, the trend of high Uber demand during the late hours of Fridays and Saturdays is less pronounced

Conclusion and Recommendations¶


Conclusion¶


We analyzed a dataset of nearly 30K hourly Uber pickup information, from New York boroughs. The data spanned every day of the first six months of the year 2015. The main feature of interest here is the number of pickups. From an environmental and business perspective, having cars roaming in an area while the demand is in another or filling the streets with cars during a low demand period while lacking during peak hours is inefficient. Thus, we determined the factors that affect pickup and the nature of their effect.

We have been able to conclude that:

  1. Uber cabs are most popular in the Manhattan area of New York.
  2. Contrary to intuition, weather conditions do not have much impact on the number of Uber pickups.
  3. The demand for Uber has been increasing steadily over the months (Jan to June).
  4. The rate of pickups is higher on the weekends in comparison to weekdays.
  5. It is encouraging to see that New Yorkers trust Uber taxi services when they step out to enjoy their evenings.
  6. We can also conclude that people use Uber for regular office commutes. The demand steadily increases from 6 AM to 10 AM, then declines a little and starts picking up till midnight. The demand peaks at 7-8 PM.
  7. We need to further investigate the low demand for Uber on Mondays.

Recommendation to business¶


  1. Manhattan is the most mature market for Uber. Brooklyn, Queens, and Bronx show potential.
  2. There has been a gradual increase in Uber rides over the last few months, and we need to keep up the momentum.
  3. Riderships are high at peak office commute hours on weekdays and during late evenings on Saturdays. Cab availability must be ensured during these times.
  4. The demand for cabs is the highest on Saturday nights. Cab availability must be ensured during this time of the week.
  5. Procure data for fleet size availability to get a better understanding of the demand-supply status and build a machine learning model to accurately predict pickups per hour, to optimize the cab fleet in respective areas.
  6. Procure more data on price and build a model that can predict optimal pricing.

Further Analysis¶


  1. Dig deeper to explore the variation of cab demand, during working days and non-working days. You can combine Weekends+Holidays to be non-working days and weekdays to be the working days.
  2. Drop the boroughs that have negligible pickups and then analyze the data to uncover more insights.