Recent Question/Assignment
The assignmnet has two parts:
v The first part is to create the algorithms in the tasks, namely: Decision Tree, Gradient Boosted Tree and Linear regression and then to apply them to the bike sharing dataset provided. Try and produce the output given in the task sections (also given in the Big-Data Assignment.docx provided on Blackboard).
v The second part is then use those algorithms created in the first part and apply them to another dataset chosen from Kaggle (other than the bike sharing dataset provided).
The quries are given in the Big-Data Assignment.docx and output is given in Big Data Assignment Marking Criteria.docx
Coding should be done using python (pyspark/spark)
Datasets
bike sharing [provided]
Student selected dataset [from Kaggle.com]
Decision Tree
Decision Tree
5
5
Decision Tree Categorical features
5
5
Decision Tree Log
5
5
Decision Tree Max Bins
5
5
Decision Tree Max Depth
5
5
Gradient Boosted Tree
Gradient Boosted Tree
5
5
Gradient boost tree iterations
5
5
Gradient boost tree Max Bins
5
5
Linear regression
Linear regression
5
5
Linear regression Cross Validation
Intercept
5
5
Iterations
5
5
Step size
5
5
L1 Regularization
5
5
L2 Regularization
5
5
Linear regression Log
5
5
75
75
Total mark
150