Recent Question/Assignment
This is for R programming involving statistics. Please provide solution in ipynb format. Please ensure to follow the same format, function names, variable names etc as shown for each question in the pdf file attached.
Part 1 Linear Regression
After 1000 years, due to human activities, the earth’s ecology has been irreversibly destroyed; hence, the earthlings seek to emigrate to a new earth-like planet, Kepler 22b. Kepler 22b (also known as Kepler Target Object KOI-087.01) is an exoplanet that orbits in the habitable zone of the sun-like star Kepler 22. It is located in the constellation Cygnus, about 600 light years (180 pc) from Earth, and was discovered by NASAs Kepler Space Telescope in December 2011.
In 3011, humans discovered that this planet has a composition similar to that of the earth and the water on which all animals and plants depend. Later, scientists conducted many field studies and proposed that we can migrate to this planet, with the migration process starting as soon as 3030. Humanity has since built many cities on Kepler 22b and completed the migration process by 3100 using the technological advanced ultra-high-speed space shuttle .
Suppose that you live in 4118, and are one of the many data scientists working on the Kepler
22b climate problem. After many expansions made on the planet, mankind wants to prevent Kepler 22b from repeating the same mistakes as that of the earth; thus, a tremendous amount of resources has been put towards the study of the sulfur dioxide level (SO2) on Kepler 22b because high SO2 levels cause acid rain.
You have been provided with three datasets, Regression_train.csv ,
Regression_test.csv , and Regression_new.csv . Using these datasets, you hope to build a model that can predict the SO2 level on this new planet using other variables.
Regression_train.csv and Regression_new.csv come with the ground-truth target label SO2 whereas Regression_test.csv comes with independent variables (input information) only.
The information of the attributes for these datasets can be found below:
year: year of data in this row month: month of data in this row day: day of data in this row hour: hour of data in this row
PM2.5: PM2.5 concentration (ug/m^3)
PM10: PM10 concentration (ug/m^3)
NO2: NO2 concentration (ug/m^3)
CO: CO concentration (ug/m^3)
TEMP: temperature (degree Celsius).
O3: O3 concentration (ug/m^3)
PRES: pressure (hPa)
DEWP: dew point temperature (degree Celsius) RAIN: precipitation (mm) wd: wind direction WSPM: wind speed (m/s)
station: name of the air-quality monitoring site SO2: SO2 concentration (ug/m^3)
PLEASE NOTE THAT THE USE OF LIBRARIES ARE PROHIBITED IN THESE QUESTIONS UNLESS STATED OTHERWISE, ANSWERS USING LIBRARIES WILL RECEIVE 0 MARKS
Question 1
Please load the Regression_train.csv and fit a multiple linear regression model with SO2 being the target variable. According to the summary table, which predictors do you think are possibly associated with the target variable (use the significance level of 0.01), and which are the Top 10 strongest predictors? Please write an R script to automatically fetch and print this information.
NOTE: Manually doing the above tasks will result in 0 marks.
# ANSWER BLOCK
# Read in the data here train - ...
# Build the multiple linear regression model here lm.fit - ...
# Get the summary of the model here fit.summary - ...
# Write the function to get the important predictors as well as the top 10 strongest top.predictors - function(fit.summary){
# Getting the important predictors coef.imp - ...
# Getting the top 10 predictors coef.most - ...
# Printing out the results, you can keep this format or make some format that lo print(paste(-The important features are: -, ...))
print(paste(-The top 10 most important features are: -, ...)) }
In [ ]:
Question 2
Rather than calling the lm() function, you would like to write your own function to do the least square estimation for the simple linear regression model parameters ß0 and ß1. The function takes two input arguments with the first being the dataset name and the second the predictor name, and outputs the fitted linear model with the form:
E[SO2] = ß^0 + ß^1x
Code up this function in R and apply it to the two predictors TEMP and CO separately, and explain the effect that those two variables have on SO2.
# ANSWER BLOCK
# Least squared estimator function lsq - function(dataset, predictor){ # INSERT YOUR ANSWER IN THIS BLOCK
# Get the final estimators beta_1 - ... beta_0 - ...
In [ ]:
# Return the results:
return(paste0(E[SO2]=, beta_0,+, beta_1,*,predictor))
}
print(lsq(train, TEMP)) print(lsq(train, CO))
ANSWER (TEXT)
Question 3
R squared from the summary table reflects that the full model doesnt fit the training dataset well; thus, you try to quantify the error between the values of the ground-truth and those of the model prediction. You want to write a function to predict SO2 with the given dataset and calculate the root mean squared error (rMSE) between the model predictions and the ground truths. Please test this function on the full model and the training dataset.
In [ ]: # ANSWER BLOCK
rmse - function(model, dataset){
return(...)
} print(rmse(...))
Question 4
You find the full model complicated and try to reduce the complexity by performing bidirectional stepwise regression with BIC. Please briefly explain what forward stepwise regression does in less than 100 words.
Calculate the rMSE of this new model with the function that you implemented previously.
In [ ]: # ANSWER BLOCK
sw.fit - step(lm.fit, ...)
summary(sw.fit)
print(rmse(...))
ANSWER (TEXT)
Question 5
You have been given a new dataset Regression_new.csv that has the latest data of Kepler 22b. You are going to apply the new model sw.fit on the new dataset to evaluate the model performance with using rMSE. When you look into rMSE, what do you find? If you think
sw.fit works well on Regression_new.csv , please explain why it does well. Otherwise, if
you think your model sw.fit doesnt perform well on Regression_new.csv , can you point out the potential reason(s) for this?
# ANSWER BLOCK
new - read.csv(...) # Reading in the new dataset
In [ ]:
print(rmse(...)) # Finding out the rMSE of the sw.fit model with respect to the new
ANSWER (TEXT)
Question 6
Although stepwise regression has reduced the model complexity significantly, the model still contains a lot of variables that we want to remove. Therefore, you are interested in lightweight linear regression models with ONLY TWO predictors. Write a script to automatically find the best lightweight model which corresponds to the model with the least rMSE on the training dataset (Not the new dataset). Compare the rMSE of the best lightweight model with the rMSE of the full model - lm.fit - that you built previously. Give an explanation for these results based on consideration of the predictors involved.
# ANSWER BLOCK
# Some variables that you would want to initialize minimum_error = ... features = ...
# CODE HERE
print(paste(The best features are, features, ; and the MSE is, minimum_error))
In [ ]:
ANSWER (TEXT)
Question 7
Rather than looking into rMSE, you want to build a lightweight linear regression model with ONLY TWO predictors which has the highest R squared. Write a script to automatically find the best lightweight model which corresponds to the model with the highest R squared on the training dataset (Not the new dataset).
Furthermore, please compare the two predictors in the best lightweight model found in the previous question and those of this question. Are the two predictors in each case different? If they do differ, please explain why?
# ANSWER BLOCK
# Some variables that you would want to initialize maximum_rsquared = ...
features = ...
# CODE HERE
print(paste(The best features are, features, ; and the rSquared is, maximum_rsqu
In [ ]:
ANSWER (TEXT)
Question 8 (Libraries are allowed)
As a Data Scientist, one of the key tasks is to build models most appropriate/closest to the truth; thus, modelling will not be limited to the aforementioned steps in this assignment.
Additionally, you need to describe/document your thought process in this model building process, this is akin to showing your working properly for the mathematic sections. If you dont clearly document the reasonings behind the model you use, we will have to make some deductions on your scores.
Note Please make sure that we can install the libraries that you use in this part, the code structure can be:
install.packages(-some package-, repos=http://cran.us.r-project.org) library(-some package-)
Remember that if we cannot run your code, we will have to give you a deduction. Our suggestion is for you to use the standard R version 3.6.1
You also need to name your final model fin.mod so we can run a check to find out your performance. A good test for your understanding would be to set the previous BIC model to be the final model to check if your code works perfectly.
In [ ]: # Build your final model here, use additional coding blocks if you need to fin.mod - NULL
In [ ]: # Load in the test data. test - read.csv(-Regression_test.csv-)
# If you are using any packages that perform the prediction differently, please chan pred.label - predict(fin.mod, test)
# put these predicted labels in a csv file that you can use to commit to the Kaggle write.csv(data.frame(-RowIndex- = seq(1, length(pred.label)), -Prediction- = pred.la
-RegressionPredictLabel.csv-, row.names = F)
In [ ]: ## PLEASE DO NOT ALTER THIS CODE BLOCK
## Please skip (dont run) this if you are a student
## For teaching team use only source(-../supplimentary.R-)
truths - read.csv(-../Regression_truths.csv-) RMSE.fin - rmse(pred.label, truths$SO2) cat(paste(-RMSE is-, RMSE.fin))