Recent Question/Assignment
CIS7031 - Programming for Data Analysis
Student Name:
Student ID:
Table of Contents
2.1:-.............................................................................................................................................................3
2.2:-.............................................................................................................................................................4
2.3:-.............................................................................................................................................................5
4.1:-.............................................................................................................................................................5
4.2:-.............................................................................................................................................................7
5.1:-.............................................................................................................................................................7
5.2:-.............................................................................................................................................................8
6: Discussion...............................................................................................................................................9
2.1:-
The required data frame created for evaluation. After it takes summation of all the rows and insert a total value into a new column named ‘total’ and added it into our data frame.
As our requirement in this section, I got the name of industries which have the highest and lowest employed workers over the period (below fig.).
a. Public_Administration had highest employed workers
b. Real_Estate has the lowest employed workers
2.2:-
To calculate overall growth from 2009 to 2018 I applied some mathematics and added a new column named as ‘t_growth’ in our dataset and find the results as we required to check which industry has highest and lowest growth over the period as :
a. Highest overall growth is in Real_Estate industry
b. Lowest overall growth is in Retail Industry
2.3:-
In this part, I find the years which performed worst and best in relation with total employment by taking some of each year’s total number of workers with some mathematics (max and min) and got the results as:
a. Worst performed year with the lowest employment was 2010.
b. Best performed year with the highest employment was 2018
4.1:-
PCA is affected by scale so we need to scale the features in our data before applying PCA. After scaling I got the scaled columns for 2 components as got the data frame with a column named as ‘Industry’, ‘PC 1’, and ‘PC 2’.
And in this part, I got the results after applying correlation as:
1. With PC 1:
a. The highest correlated industry is -Public_Administration- with a value of 0.713494.
b. The lowest correlated industry is -Professional_Service- with a value of 0.046291.
2. With PC 2:
a. The highest correlated industry is -Professional_Service- with the value of 0.817254
b. The lowest correlated industry is -Production- with a value of
0.023005.
4.2:-
Do the aforementioned industries are also correlated over each year?
Yes, the aforementioned industries are also correlated over each year. Because the Correlation is between numerical columns of a dataset. As we found between industries (Labeled) and each year the correlation results are related to each other and affected by each other.
5.1:-
In this part we will apply K-Means Clustering for the given condition.
The condition given was that we have to apply K-Means Clustering for K = 2 & 3 on the dataset which has columns of best (2018) and worst (2010) performed year only with the Industry column (From 2.3).
So by doing some operations on the dataset we got the required dataset and ready to apply Clustering.
Before going further I will give you some idea about K-Means Clustering.
K-Means Clustering is one of the simplest and popular Unsupervised Machine Learning Algorithm and its main objective is “grouping similar data points and discover underlying patterns”.
To do the best of it first we have to find the value of K, by some methods like the “Elbow” method. But in our problem, there is already specified that we have to choose K = 2 and 3.
So applying the value of K = 2 and 3 we got our data in clusters( 2 clusters and 3 clusters).
KMeans can be called by Scikit Learn Library.
I’m inserting some of the graphs which will show the clustered data.
K = 3 K = 2.
5.2:-
Comparing K _means with Hierarchical:
As we have seen in the above section, the results of both the clustering are mostly similar to the same dataset.
It may be possible that when we have a small dataset, the shape of clusters differs a little.
However, along with many similarities, these two techniques have some differences also.
6: Discussion
The employment data of Wales from the StatsWales data source are taken as our requirement as we have all the industry’s employed workers year wise.
To do so first we downloaded all required datasets in require form then prepared the datasets to make a dataset that is feasible for our computation.
We got our dataset as it contains total of 11 columns (Industry, 2009, 2010, 2011…. 2018) and 10 rows (Each industry -Agriculture, Production, Construction, Retail, ICT, Finance, Real_Estate, Professional_Service, Public_Adminstration, and Other_Service).
After it I checked that the dataset has null values or not and found that there are no null values.
The preparation of data is done and we are ready to move further.
Analyzing Data:
The dataset has all the numeric data in each year wise column and the Industry column has object type data.
1. First we calculated the total number of employed workers in each industry over the years (From 2009 to 2018). Then plotting it into a bar plot we got a plot which shows the total count of each industry. The industry with the maximum employee is “Public_Administration” and the minimum employee is “Real_Estate”.
2. In the next part we calculated the overall growth of employed workers in each industry and got the results as the highest growth occurred in “Real_Estate” and the lowest growth occurred in “Retail”.
3. In the next part, we calculated the total number of employed workers in each year and got
The best performed year was 2018 and the worst performed year was 2010.
4. Next we visualize our dataset using Plotly and got some bubbled scatter plots for each year’s data.
5. PCA is affected by scale so we need to scale the features in our data before applying PCA. After scaling I got the scaled columns for 2 components as got the data frame with a column named as ‘Industry’, ‘PC 1’, and ‘PC 2’.
And in this part, I got the results after applying correlation as: A. With PC 1:
a. The highest correlated industry is -Public_Administration- with a value of 0.713494.
b. The lowest correlated industry is -Professional_Service- with a value of 0.046291. B. With PC 2:
a. The highest correlated industry is -Professional_Service- with the value of 0.817254
b. The lowest correlated industry is -Production- with a value of
0.023005.
6. In the next part I have found the correlation results between Industry and all the years (by taking one year at a time).
7. In the next part we applied K-Means Clustering and Hierarchical clustering on the columns 2010 and 2018 (From 2.3) and got the below outcomes shown in figures.
K-Means:
Hierarchical
As we have seen in the above section, the results of both the clustering are mostly similar to the same dataset.
It may be possible that when we have a small dataset, the shape of clusters differs a little.
However, along with many similarities, these two techniques have some differences also.
Some Screenshots of our Results: