Recent Question/Assignment
The main task of the assignment is for students to perform analysis using fundamental skills of R programming
ETW2001 2022 S2 Assessment 2
Individual Assignment (15%)
The main task of the assignment is for students to perform analysis using fundamental skills of R programming. This assignment requires coding skills from week 1 to week 6.
Deadline: Week 7, 9 September 2022 (Friday) by 11:55 PM in MYT
Submission: A report in pdf format and an R-script that records all codes for data preparation and output.
Note
• When you submit in Moodle, it might show an error saying, “You must upload a supported file type for this assignment. Accepted file types are; …”. You can ignore this message as long as you uploaded the R-script.
• Do not take screenshots of your output. For plots, export it properly. For tables, you can create tables using the Microsoft word function. There are some packages to create a table in RStudio, but it requires too much work for you. You can try if you want.
• Since this assignment does not require much writing for the report, you will spend most time searching and writing codes. I encourage you to discuss task 1 via the Discord channel.
• This assignment covers the Unit’s Learning outcomes 1 and 2.
• For 1.7 and 1.8, you might need to refer to Week 6 lecture material.
• You do not need to copy and paste the codes into the report since it is recorded in R-script. Your report should include plots and analysis.
Context
Population data is valuable as it affects several aspects of the sustainable growth of countries. Analysis and estimation of population allow us to manage global challenges such as poverty, starvation, distribution of commodities, energy, environment, and our well-being. This can be useful for multinational firms in that observing demographic changes provides an insight into future market size.
For this assignment, you will begin with tidying up the data and come up with some analysis of multiple variables related to the population. The data covers from the years 2001 to 2050, where future values are estimated values by the World Bank.
Data source: https://databank.worldbank.org/source/population-estimates-and-projections
Task 1 – Data preparation (33 marks)
The answer for this task should be written in the R-script.
1.1 Assign a new name to the data frame called “df” by loading the datasheet from the excel file (Assessment 2 data) to R-Studio. Load the packages that you use in the assignment. Ensure the loaded sheet is “data” and set the first row as the column name. (2 marks)
1.2 Using the dplyr functions, delete the “Series code” column from “df”. Save the output as “df1” (2m)
1.3 Using the dplyr functions, convert the “df1” data for columns from 2001 to 2050 as numeric. Save the output as “df2” (dim = 49747, 53). You might see an error saying, “There were 50 or more warnings. Do not worry about it. (2m)
1.4 Using the dplyr functions, convert the “df2” data for columns from 2001 to 2050 to take 2 decimal places. Save the output as “df3”. (2m)
1.5 Change the format of column names as below:
Convert the “df3” column names from “2001 [YR2001]” to “2001”. Apply this to the rest of the columns up to 2050. Make sure you do not repeat the codes 50 times. You need two sets of codes. One for the new set of column names and the other one to apply as new column names. (5m)
1.6 Drop all NA values from “df3”. Save this output as “df4”. The dimension of df4 should be 33491 by 53. (2m)
1.7 There are too many columns for each year. Put all year column names from “df4” under the “Year” column and put their values under the “Values” column. Also, convert “Year” as numerical values. Save this output as “df5”. The dimension of “df5” is 1674550 by 5. (4m)
1.8 There are too many rows for each country. Put all “Series name” from “df5” to separate columns following its name. Save this output as “df6”; its dimension is 12950 by 156. (4m)
1.9 Since there are 156 columns, it would be difficult to identify variables name whenever you want to perform further analysis. Create a data frame called “index” to show the list of variable names from “df6”. (2m)
1.10 Using the index number, create “df7”, including the followings: (2m)
• Country Name
• Country Code
• Year
• Age dependency ratio (% of working-age population)
• Population ages 65 and above, female (% of female population)
• Population ages 65 and above, male (% of male population)
• Population, female (% of the total population)
• Population, total
• Rural population (% of the total population)
• Urban population (% of the total population)
1.11 Rename the above variables from “df7” as below: (2m)
Original Variable name New name
Country Name Use as it is
Country Code Use as it is
Year Use as it is
Age dependency ratio (% of working-age population) dep_ratio
Population ages 65 and above, female (% of female population) pop65f
Population ages 65 and above, male (% of male population) pop65m
Population, female (% of total population) popf
Population, total pop
Rural population (% of total population) pop_rural
Urban population (% of total population) pop_ruban
1.12 Using the dplyr functions, calculate population growth (popgr). You may refer to the formula below: ?????????? =&( ??????! -??????!-#)**100
??????!-#
Create “df8” that includes popgr and all the variables from “df7”. It is important to group the data by country, then to calculate popgr. (4m)
Task 2 – Population around the world (43 marks) You will be using “df8” for Task 2 and Task 3.
2.1 Create a plot that shows changes in population growth from 2002 to 2022. Your plot must include the following aspects: (10m)
• Use data for 5 income categories (Low, Lower middle, Middle, Upper middle and High income) which can be found in `Country Name`
• Assign 5 different colours for these 5 categories.
• The line width is 1.2.
• Display the y-axis from 0 to 3 and the breaks by 0.5.
• Display the x-axis from 2002 to 2022 and the breaks by 2.
• Add appropriate title, subtitle and axis labels.
• Locate the legend at the bottom of the plot. The legend items should fit nicely.
• You may add an extra theme to make your plot more aesthetic (extra marks not granted)
2.2 Analyse the plot you created. Based on plot 2.1, forecast (without any calculation) how the population growth will be in the future. Your analysis should not exceed 1 page. (10m)
2.3 Replicate the plot you created in 2.1 but covering a time span from 2022 to 2050. Make a necessary adjustment for the title and scale (the x-axis breaks by 5). (3m)
2.4 Does the plot from 2.3 match your forecast from 2.2? Discuss any observed differences to your forecast. Provide possible economic reasons for any mismatch. Your analysis should not exceed half a page. (3m)
2.5 Create a plot that shows changes in dependency ratio from 2002 to 2022. Your plot must include the following aspects: (3m)
• Use data for 5 income categories (Low, Lower middle, Middle, Upper middle and High income).
• Assign 5 different colours for these 5 categories.
• The line width is 1.2.
• Display the y-axis from 40 to 100 and the breaks by 10.
• Display the x-axis from 2002 to 2022 and the breaks by 2.
• Add appropriate title, subtitle and axis labels.
• Locate the legend at the bottom of the plot. The legend items should fit nicely.
• You may add an extra theme to make your plot more aesthetic (extra marks not granted)
2.6 Analyse the plot you created. Based on plot 2.5, forecast (without any calculation) how the dependency ratio will be in the future. Your analysis should not exceed 1 page. (6m)
2.7 Replicate the plot you created in 2.5 but covering a time span from 2022 to 2050. Make a necessary adjustment for the title and scale (the x-axis breaks by 5). (3m)
2.8 Does the plot from 2.7 match your forecast from 2.6? Discuss any observed differences to your forecast. Discuss why the predicted data by the World Bank shows such patterns for each income category. Your analysis should not exceed 1 page. (5m)
Task 3 – Mini research (24 marks)
Obtain a random country name by using your student ID as follows:
set.seed(“your student ID”) sample(df8$`Country Name`, size=1)
Using the code above, you will obtain a random country’s name from the ‘Country Name’ variable. It is okay if your output shows a region or group of countries.
3.1 Create a plot that includes 4 line graphs within. Include the following details (5m):
• A country or region that is selected by random sampling.
• 4 plots for Population, Population growth rate, Dependency Ratio, and Urban population.
• Since the population scale is too large relative to the other variables, express the population in a unit of million. You are allowed to use billion if you got region instead of country.
• Label each y-axis.
• The x-axis should indicate years less or equal to 2020.
• The main title should be “Population in ‘country’s name’”.
Before you filter the selected country, you should ungroup “df8” first.
3.2 Analyse the plot you created in 3.1. If you find any unusual changes, search for possible reasons with proper citation. Do not write more than a page. (8m)
3.3 Repeat 3.1, include the randomly selected country/region and “Malaysia”. For each graph, two lines should indicate the randomly selected country/region and Malaysia respectively. Set all other details the same with a proper title and legend. (3m)
3.4 Considering the economic development level, compare and analyse each graph. A good analysis should include insight with sound reasoning. Do not write more than a page. (8m)