Air Pollution Data Exploration¶


Context¶


Air pollution is all around us. Indoors, outdoors, in cities, and in the countryside. It affects us all, whether we realize it or not. The rise in permissible concentrations of various pollutants observed over the last few decades contributes to rising pollution levels. As a result, affecting the weather conditions in an unfavorable manner and the formation of smog phenomena. In fact, the affected air quality level in the atmosphere has a negative impact on individual health and may cause an economic imbalance. Diseases caused by rising pollution levels are one of the major issues confronting urban settlements.

In this case study, we will explore and visualize the Air Pollution dataset, which contains information about air molecules and pollutants found in the air.


Objective¶


The objective of this problem is to reduce the number of features by using dimensionality reduction techniques such as PCA and t-SNE, and extract insights about the data.


Dataset¶


  • This dataset contains about 13 months of data on major pollutants and meteorological levels of a city.

  • The data shows average readings for Nitric Oxide(NO), Nitrogen Dioxide(NO2), Oxides of Nitrogen, Ozone, Particulate Matter (PM10 and PM2.5), Sulphur Dioxide, Benzene, Toulene, and P_Xylene. Measured in Micrograms per Cubic Meter of Air (ug/m3). Also shows few generic features WindSpeed, VerticalWindSpeed, BarPressure, Weather,Temp, and WindDirection.

  • The PD denotes the Previous Day. In this dataset, we have 5 columns that are Previous Date values of correponsing columns in the dataset. They are PD_PM10, PD_NO2, PD_SO2, PD_CO, PD_PM2.5.

  • The Date is the feature at which the data was recorded.

Importing libraries and overview of the dataset¶

In [1]:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# To scale the data using z-score 
from sklearn.preprocessing import StandardScaler

# Importing PCA and t-SNE
from sklearn.decomposition import PCA

from sklearn.manifold import TSNE

Let's get an overview of the data and prepare it for visualization

In [2]:
# Loading the data
data_air_pol = pd.read_csv("AirPollution.csv")
In [3]:
data_air_pol.head()
Out[3]:
SrNo Date NO CO NO2 O3 SO2 PM2.5 Benzene Toulene ... WindSpeed VerticalWindSpeed Solar BarPressure Weather PD_PM2.5 PD_PM10 PD_NO2 PD_SO2 PD_CO
0 1 04-04-2015 7.22 1.77 47.94 51.07 16.88 48.99 2.53 9.65 ... 1.22 0.08 162.18 732.25 Summer NaN NaN NaN NaN NaN
1 2 05-04-2015 6.99 0.22 45.27 19.26 16.71 60.20 3.19 11.10 ... 0.62 -0.04 99.37 734.05 Summer 48.99 82.85 47.94 16.88 1.77
2 3 09-04-2015 7.60 0.50 59.86 94.29 13.11 46.93 2.29 8.61 ... 1.55 -0.17 146.94 728.08 Summer NaN NaN NaN NaN NaN
3 4 10-04-2015 7.57 0.77 63.56 66.91 16.19 112.95 3.92 10.76 ... 1.18 -0.15 150.07 730.47 Summer 46.93 171.36 59.86 13.11 0.50
4 5 11-04-2015 8.34 0.48 61.99 69.48 20.28 104.87 5.19 15.95 ... 0.88 0.15 137.01 730.62 Summer 112.95 232.22 63.56 16.19 0.77

5 rows × 27 columns

Checking the info of the data

In [4]:
data_air_pol.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 403 entries, 0 to 402
Data columns (total 27 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   SrNo               403 non-null    int64  
 1   Date               403 non-null    object 
 2   NO                 401 non-null    float64
 3   CO                 402 non-null    float64
 4   NO2                401 non-null    float64
 5   O3                 397 non-null    float64
 6   SO2                399 non-null    float64
 7   PM2.5              401 non-null    float64
 8   Benzene            402 non-null    float64
 9   Toulene            402 non-null    float64
 10  P_Xylene           372 non-null    float64
 11  NOx                401 non-null    float64
 12  PM10               401 non-null    float64
 13  WindDirection      402 non-null    float64
 14  NH3                401 non-null    float64
 15  RH                 402 non-null    float64
 16  Temp               401 non-null    float64
 17  WindSpeed          402 non-null    float64
 18  VerticalWindSpeed  401 non-null    float64
 19  Solar              401 non-null    float64
 20  BarPressure        401 non-null    float64
 21  Weather            403 non-null    object 
 22  PD_PM2.5           393 non-null    float64
 23  PD_PM10            392 non-null    float64
 24  PD_NO2             391 non-null    float64
 25  PD_SO2             390 non-null    float64
 26  PD_CO              392 non-null    float64
dtypes: float64(24), int64(1), object(2)
memory usage: 85.1+ KB

Observations:

  • There are 403 observations and 27 columns in the data.
  • All the columns except Date and Weather are of numeric data type.
  • Weather is of object data type. We can create dummy variables for each category and convert it to numeric data type.
  • The majority of the columns have some missing values.
  • Let's check the number of missing values in each column.
In [5]:
data_air_pol.isnull().sum()
Out[5]:
SrNo                  0
Date                  0
NO                    2
CO                    1
NO2                   2
O3                    6
SO2                   4
PM2.5                 2
Benzene               1
Toulene               1
P_Xylene             31
NOx                   2
PM10                  2
WindDirection         1
NH3                   2
RH                    1
Temp                  2
WindSpeed             1
VerticalWindSpeed     2
Solar                 2
BarPressure           2
Weather               0
PD_PM2.5             10
PD_PM10              11
PD_NO2               12
PD_SO2               13
PD_CO                11
dtype: int64

Observations:

  • All the columns except SrNo and Date have missing values.
  • We will impute (fill in) these missing values by using the mode for the categorical variables, the median for the skewed numerical variables, and the mean for the non-skewed numerical variables.

Data Preprocessing and Exploratory Data Analysis¶

In [6]:
# Dropping the columns - SrNo and Date
data_air_pol.drop(columns = ["SrNo", "Date"], inplace = True)

Summary Statistics¶

In [7]:
data_air_pol.describe().T
Out[7]:
count mean std min 25% 50% 75% max
NO 401.0 81.323292 97.204368 1.12 11.3800 39.220 118.0300 480.90
CO 402.0 2.210000 1.601589 0.22 1.2025 1.780 2.7375 19.90
NO2 401.0 74.433791 25.862319 25.27 54.5000 73.680 91.4300 149.01
O3 397.0 56.298917 29.690436 4.98 30.1200 53.210 76.8500 159.70
SO2 399.0 25.988496 25.273204 0.00 9.5150 21.680 37.1450 371.75
PM2.5 401.0 130.160673 84.196954 18.75 68.1100 108.240 166.1600 550.23
Benzene 402.0 6.525945 3.983177 0.16 3.2350 5.910 8.9400 17.51
Toulene 402.0 17.192488 10.410131 0.00 9.6950 16.240 22.9725 48.41
P_Xylene 372.0 8.128844 6.733961 0.30 3.3825 6.060 10.5700 32.35
NOx 401.0 194.681222 141.121196 26.49 87.6700 152.310 264.3800 743.70
PM10 401.0 265.254389 134.702840 36.63 163.6600 249.670 353.2700 705.70
WindDirection 402.0 202.851915 31.287309 126.69 180.4025 209.415 228.8675 258.06
NH3 401.0 39.630998 20.754359 1.00 25.9000 34.700 50.7200 144.82
RH 402.0 52.146990 15.878966 15.19 41.5750 53.870 63.8525 86.82
Temp 401.0 24.425686 7.419778 6.56 18.8200 26.570 29.8600 41.05
WindSpeed 402.0 1.238035 0.556081 0.30 0.8525 1.160 1.5200 3.45
VerticalWindSpeed 401.0 0.188853 0.785720 -2.87 -0.1700 -0.050 0.4000 3.56
Solar 401.0 112.954813 37.465048 3.31 84.9000 121.850 141.0300 345.68
BarPressure 401.0 732.560698 1.799439 721.67 731.6900 733.380 733.7600 736.08
PD_PM2.5 393.0 130.505445 84.786348 18.75 68.0500 108.240 170.9900 550.23
PD_PM10 392.0 265.622551 135.526761 36.63 163.3800 250.105 357.9550 705.70
PD_NO2 391.0 74.319003 25.996784 25.27 54.0300 73.520 91.3700 149.01
PD_SO2 390.0 25.964949 25.417022 0.00 9.4125 21.605 37.1975 371.75
PD_CO 392.0 2.200765 1.602474 0.37 1.2075 1.775 2.6675 19.90

Observations:

  • The normal level of PM10 is about 100 micrograms per cubic meter and for PM2.5 level, it is about 60 micrograms per cubic meter. Here, we can infer that PM10 has a high median of ~ 250, and PM2.5 has a median of 108, which are harmful to the environment and people.
  • The higher level of different types of particulate matter indicates that the data might belong to an urban or industrial area.
  • The concentration of nitrogen oxide (NO) in the air should be in a range of 60-80 µg/m3. The mean of nitrogen oxide is 74.4, which is in the given range ensuring less impact.
  • The standard deviation is very high for the variables - NO, NOx, PM10, and PM2.5, which indicates that the pollution varies highly in a year (as the data is for 13 months).
  • The maximum value of many features reaches an alarming high level, indicating there might be outliers in the data.

Let's check the distribution and outliers for numerical columns in the data

In [8]:
result = data_air_pol.select_dtypes(include = 'number')

cont_cols = list(result)

for col in cont_cols:

    print(col)
    
    print('Skew :',round(data_air_pol[col].skew(),2))
    
    plt.figure(figsize = (15, 4))
    
    plt.subplot(1, 2, 1) #index of location of subplot
    
    data_air_pol[col].hist(bins = 10, grid = False)
    
    plt.ylabel('count')
    
    plt.subplot(1, 2, 2)
    
    sns.boxplot(x = data_air_pol[col])
    
    plt.show()
NO
Skew : 1.65
CO
Skew : 4.16
NO2
Skew : 0.35
O3
Skew : 0.53
SO2
Skew : 6.87
PM2.5
Skew : 1.45
Benzene
Skew : 0.71
Toulene
Skew : 0.61
P_Xylene
Skew : 1.55
NOx
Skew : 1.32
PM10
Skew : 0.61
WindDirection
Skew : -0.49
NH3
Skew : 1.34
RH
Skew : -0.36
Temp
Skew : -0.57
WindSpeed
Skew : 0.9
VerticalWindSpeed
Skew : 0.86
Solar
Skew : 0.15
BarPressure
Skew : -1.96
PD_PM2.5
Skew : 1.44
PD_PM10
Skew : 0.6
PD_NO2
Skew : 0.36
PD_SO2
Skew : 6.9
PD_CO
Skew : 4.26

Observations:

  • The distribution plots show that PD_CO, PD_SO2, NH3, PD_PM2.5, NOx, P_Xylene, and PM2.5 variables are highly right-skewed. It is evident from the boxplots that all these variables have outliers towards the right.
  • RH, Temp, and WindDirection are the variables that are slightly left-skewed. However, their respective boxplot does not show any outliers.
  • Benzene, Toulene, PM10, PD_PM10, VerticalWindSpeed, and WindSpeed seem to have a moderate right skew.

Now, let's check the correlation among different variables

In [9]:
plt.figure(figsize = (18, 12))

sns.heatmap(data_air_pol.corr(), annot = True, fmt = '0.2f')

plt.show()

Observations:

  • We can see a high positive correlation among the following variables:
    1. NO2 and PD_NO2
    2. PM2.5 and PD_PM2.5
    3. PM10 and PD_PM2.5
    4. PD_PM10 and PD_PM2.5

  • We can see a high negative correlation among the following variables:
    1. Temp and NO2
    2. VerticalWindSpeed and PD_SO2
    3. VerticalWindSpeed and PD_PM2.5
    4. WindDirection and VerticalWindSpeed
    5. SO2 and WindSpeed

  • Nitrates and fine particulate matter 2.5 (PM2.5) are positively correlated with their previous day values, i.e., PD_NO2 and PD_PM2.5, which makes sense.
  • Particle matter of 2.5 micrometers in diameter is positively correlated with the Particle matter of 10 micrometers in diameter.
  • Temp and NO2 are negatively correlated which makes sense as the nitrate molecules in the air increase and the temperature decreases.
  • Since The WindSpeed and VerticalWindSpeed are negatively correlated to pollutants like PD_SO2 and PD_PM2.5. This makes sense as higher wind speeds generally translate to a greater dispersion of air pollutants, resulting in lower air pollution concentrations in areas with stronger winds.

Data Preprocessing¶

In [10]:
# Imputing missing values with mode for the Weather column and with median for all other columns as majority of variables are skewed
for col in data_air_pol.columns:
    
    if col == "Weather":
        data_air_pol[col].fillna(value = data_air_pol[col].mode()[0], inplace = True)
    
    else:
        data_air_pol[col].fillna(value = data_air_pol[col].median(), inplace = True)
In [11]:
# Creating dummy variables for Weather column
data_air_pol = pd.get_dummies(data_air_pol, drop_first = True)

Scaling the data¶

In [12]:
scaler = StandardScaler()
data_air_pol_scaled = scaler.fit_transform(data_air_pol)
In [13]:
data_air_pol_scaled = pd.DataFrame(data_air_pol_scaled, columns = data_air_pol.columns)

Principal Component Analysis¶

In [14]:
# Defining the number of principal components to generate
n = data_air_pol_scaled.shape[1]

# Finding principal components for the data
pca1 = PCA(n_components = n, random_state = 1)
data_air_pol_pca = pd.DataFrame(pca1.fit_transform(data_air_pol_scaled))

# The percentage of variance explained by each principal component
exp_var1 = pca1.explained_variance_ratio_
In [15]:
# Visualize the explained variance by individual components
plt.figure(figsize = (10, 10))

plt.plot(range(1, 29), pca1.explained_variance_ratio_.cumsum(), marker = 'o', linestyle = '--')

plt.title("Explained Variances by Components")

plt.xlabel("Number of Components")

plt.ylabel("Cumulative Explained Variance")

plt.show()
In [16]:
# Find the least number of components that can explain more than 70% variance (standard)
sum = 0

for ix, i in enumerate(exp_var1):
    
    sum = sum + i
    
    if(sum>0.70):
        print("Number of PCs that explain at least 70% variance: ", ix + 1)
        break
Number of PCs that explain at least 70% variance:  5
In [17]:
# Making a new dataframe with first 5 principal components as columns and original features as indices
cols = ['PC1', 'PC2', 'PC3', 'PC4', 'PC5']

pc1 = pd.DataFrame(np.round(pca1.components_.T[:, 0:5], 2), index = data_air_pol_scaled.columns, columns = cols)
pc1
Out[17]:
PC1 PC2 PC3 PC4 PC5
NO 0.25 -0.05 0.18 0.14 -0.13
CO 0.21 0.04 0.18 -0.00 -0.03
NO2 0.19 -0.22 0.18 0.06 0.24
O3 0.02 -0.38 -0.02 0.18 0.08
SO2 0.12 -0.19 -0.20 0.28 -0.11
PM2.5 0.26 -0.06 -0.10 -0.18 -0.14
Benzene 0.27 0.09 0.15 0.01 -0.01
Toulene 0.25 0.10 0.27 0.08 0.01
P_Xylene 0.25 0.07 0.22 0.03 -0.10
NOx 0.24 0.01 0.26 0.15 -0.12
PM10 0.23 -0.17 -0.10 -0.16 -0.20
WindDirection 0.09 -0.06 0.03 0.13 -0.56
NH3 0.24 0.04 -0.12 -0.08 0.11
RH 0.10 0.46 -0.02 0.01 0.18
Temp -0.21 -0.17 0.30 -0.06 -0.20
WindSpeed -0.20 -0.04 -0.27 -0.07 -0.07
VerticalWindSpeed -0.03 -0.22 0.28 -0.20 0.31
Solar -0.18 -0.22 0.14 0.11 -0.27
BarPressure 0.13 -0.01 -0.29 0.27 -0.06
PD_PM2.5 0.24 -0.05 -0.18 -0.24 -0.03
PD_PM10 0.22 -0.15 -0.19 -0.24 -0.04
PD_NO2 0.18 -0.23 0.08 -0.02 0.28
PD_SO2 0.13 -0.17 -0.16 0.24 -0.08
PD_CO 0.19 0.05 0.06 -0.13 0.07
Weather_Monsoon -0.10 0.36 0.15 0.08 -0.24
Weather_Spring 0.02 -0.04 -0.19 0.53 0.29
Weather_Summer -0.13 -0.33 0.02 -0.28 -0.05
Weather_Winter 0.17 0.12 -0.33 -0.25 -0.09

Observations:

  • The first principal component, PC1, seems to be related to hydrocarbons like Benzene, Toluene, and Xylene that are generated due to fuel combustion.
  • The second principal component, PC2, seems to be related to humidity (RH), Ozone level, and rain, i.e., the rainy season. Whenever humidity goes down, the Ozone level goes up - more sunshine. Higher humidity is an indication of rain which has the potential to wash away atmospheric pollution.
  • The third principal component, PC3, seems to be capturing air pressure - low pressure accompanies windy and rainy conditions, dispersing the pollution away. High pressure stills the wind allowing pollution to build up in urban areas.
  • The fourth principal component, PC4, seems to be associated with high values of toxic gases like SO2 and is also related to the weather.
  • The fifth principal component, PC5, seems to be explaining the direction of the wind which impacts air pollution irrespective of the concentration of all pollutants. If the wind is blowing towards an urban area from an industrial area then pollution levels are likely to be higher in the town or city than if the air is blowing from another direction, for example, open farmland.

t-SNE¶

Let's try to visualize the data for different perplexity values

In [18]:
for i in range(10, 50, 5):# finding perplexity
    tsne = TSNE(n_components = 2, random_state = 1, perplexity = i)
    
    data_air_pol_tsne = tsne.fit_transform(data_air_pol_scaled)
    
    data_air_pol_tsne = pd.DataFrame(data_air_pol_tsne)
    
    data_air_pol_tsne.columns = ['X1', 'X2']
    
    plt.figure(figsize = (7,7))
    
    sns.scatterplot(x = 'X1', y = 'X2', data = data_air_pol_tsne)
    
    plt.title("perplexity = {}".format(i))
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(

Observations:

  • We observe that some perplexity values like 35 and 45 can capture the underlying patterns in the data better than other values.
  • This shows that perplexity plays an important role in t-SNE implementation.
  • Let's visualize again with perplexity equal to 35 as 4 clear groups are distant from each other, i.e., well separated.
In [19]:
# Fitting t-SNE with number of components equal to 2
tsne = TSNE(n_components = 2, random_state = 1, perplexity = 35)

data_air_pol_tsne = tsne.fit_transform(data_air_pol_scaled)

# Converting the embeddings to a dataframe
data_air_pol_tsne = pd.DataFrame(data_air_pol_tsne, columns = ["X1", "X2"])

# Scatter plot for two components
plt.figure(figsize = (7, 7))

sns.scatterplot(x = 'X1', y = 'X2', data = data_air_pol_tsne)

plt.show()
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
  warnings.warn(
/Users/rija/opt/anaconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(

Observations:

  • We can clearly see 4 groups in the data.
  • Let's label these 4 groups using the values of the X1 and X2 axes.
In [20]:
# Let's assign points to 4 different groups
def grouping(x):
    first_component = x['X1']
    
    second_component = x['X2']
    
    if second_component > 12:
        return 'group_1'
    
    elif (second_component < -10) and (first_component > -10): 
        return 'group_2'
    
    elif (second_component < 12) and (first_component > -9):
        return 'group_3'
    
    else: 
        return 'group_4'
In [21]:
data_air_pol_tsne['groups'] = data_air_pol_tsne.apply(grouping, axis = 1)
In [22]:
# Scatter plot for two components with hue
plt.figure(figsize = (7, 7))

sns.scatterplot(x = 'X1', y = 'X2', data = data_air_pol_tsne, hue = 'groups')
Out[22]:
<AxesSubplot:xlabel='X1', ylabel='X2'>

Now, let's try to visualize the properties of these four groups with respect to all the numeric variables in the data.

In [23]:
all_col = data_air_pol_scaled.columns[:-4].tolist()

plt.figure(figsize = (30, 50))

for i, variable in enumerate(all_col):
    plt.subplot(8, 3, i + 1)
    
    sns.boxplot(y=data_air_pol_scaled[variable], x=data_air_pol_tsne['groups'])
    
    plt.tight_layout()
    
    plt.title(variable)

plt.show()

Observations:

  • There are four groups in the data. Each group has a different set of characteristics.

  • Group 1 represents hot and humid areas. As there is less variability in the pressure, the wind speed is low. Higher humidity is an indication of rain which has the potential to wash away atmospheric pollution, resulting in low levels of pollutants - fine particulate matter and hydrocarbons.

  • Group 2 represents areas with medium humidity and temperature with low pressure and wind speed. The level of pollutants is medium. These might be developing urban areas where the Ozone (O3) concentration is medium to high, resulting in comparatively higher solar radiation.

  • Group 3 is the largest and represents low-humidity and high-temperature areas. As the variability in pressure is high, the wind speed is high. If the wind is blowing towards an urban area from an industrial area then pollution levels are likely to be higher in the town or city than if the air is blowing from another direction, for example, open farmland. These might be developed urban areas as they have the highest level of Ozone pollutant and solar radiation, which could be a result of pollutants emitted by cars, power plants, etc.

  • Group 4 represents industrial areas as it has a very high concentration of hydrocarbons, which are generally a result of fuel combustion. These areas have a high level of fine particulate matter and other pollutants. As there is less variability in air pressure, the wind speed is low.