Recent Question/Assignment
STM4PSD (Semester 2) ASSIGNMENT 1, 2021
Due by 5pm on Friday 13 August, 2021.
1. Submit your assignment as a single scanned PDF file through the LMS before the due time. By submitting your work electronically, you are affirming that it is all your own work, and you will be asked to confirm this as you submit it.
2. You may use facts from the reading materials and from the lab classes to answer these questions.
3. Working must be shown to support your answers. It is not only the final answer that is important, but your mastery of the required techniques, and the way you communicate your ideas and your approach to the problems. You will be assessed on the way you communicate your answers.
4. There are 20 total marks available for this assignment. There will be 14 marks awarded for correct solutions, 3 marks for written communication, and 3 completeness marks for making a reasonable attempt at all questions.
5. Late submissions for non-emergency reasons will not be accepted unless you have made an arrangement with the subject coordinator before the due date.
6. In accordance with department policy, students may be asked by the subject coordinator to verbally explain or demonstrate their answers.
Question 1. A random variable X has the following probability mass function:
x 1 3 6 7
P(X = x) 0.121 0.606 0.161 0.112
For this question, use 3 decimal places of accuracy while performing all calculations, and give all final answers to 3 decimal places.
(a) Write down the set ?X.
(b) Determine E(X), Var(X) and SD(X).
(c) Let A denote the event “X = 6”, let B denote the event “X = 3”, and let C denote the event “X = 3”.
Determine each of the following:
(i) P(A), P(B) and P(C)
(ii) P(A n B) and P(A n C)
(iii) P(B | A) and P(C | A)
Question 2. The skewness of a random variable X is defined as:
,
where µ = E(X) and s = SD(X). It is used to measure the level of asymmetry in a probability distribution.
(a) Write code for an R function skewness which takes as input two variables: events and probabilities, and returns the skewness (as per the above definition) based on these values. The variable events is assumed to be a vector containing the events in the sample space of a discrete random variable X, and the variable probabilities is assumed to be a vector containining the respective probability of each event.
You should make use of the expected.value and variance functions from the Week 2 Lab class in your code. You may also find it helpful to model your solution on those functions.
(b) Use your function to verify that the skewness of the random variable X in Question 1 is approximately 0.598.
For this question, submit the code you write for part (a), and the commands used to evaluate part (b).
1 of 2
Question 3. Suppose that you have implemented a machine learning model to detect and filter spam on your email server. Historical data shows that 20% of all incoming email to your server is spam. After using test data on your model, you estimate that it correctly identifies 98% of spam emails as spam, but also incorrectly identifies 3% of legitimate emails as spam. Let S denote the event that an email is actually spam, and let T denote the event that it is identified as spam by the machine learning model. Assume that a positive instance refers to a spam email.
For this question, use 3 decimal places of accuracy while performing all calculations, and give all final answers to 3 decimal places.
(a) Based on the information described above, state the probabilities P(S), P(T | S) and P(T | Sc).
(b) According to the reading materials, the probability P(T | Sc) is called the false positive rate. What name is given to the probability P(T | S)?
(c) Use the Law of Total Probability to calculate the probability that an email is identified as spam.
(d) Use Bayes’ Theorem to calculate P(S | T) and P(S | Tc). According to the reading materials, what names are given to these probabilities?
(e) Determine P(Sc | T) and P(Sc | Tc).
(f) Suppose that you have used your model to classify some emails, and it has identified 100 of them as spam.
(i) Let Y denote the number of these emails which are actually legitimate. Based on your calculations above, what probability distribution does Y follow? Give your answer in the form Y ~ .
(ii) Based on the estimated false positive rate of the model, a colleague tells you to expect that 3 of these emails are actually legitimate. Is your colleague correct? If so, explain. If not, give a correct description.
Question 4. Consider the following transaction database:
Transaction ID Items
1 {item1, item2, item3, item4, item5}
2 {item2, item4}
3 {item1, item2, item3, item5}
4 {item2}
5 {item2, item3, item4, item5}
6 {item1}
7 {item1, item2, item4, item5}
8 {item1, item3, item5}
9 {item1, item2, item3, item5}
10 {item2, item3}
11 {item1, item2, item3, item5}
For this question, give exact answers in all parts. Simplify any fractions as much as possible.
(a) Determine all one-item and two-item item-sets with minimum support . Show all steps of your working.
(b) Use your answer to (a) to determine all three-item item-sets with minimum support .
(c) Consider the following two assocation rules:
R1: {item1, item2} ? {item3}
R2: {item3, item5} ? {item2}
(i) Calculate the support, confidence and lift of the following two association rules:
(ii) On the basis of the given data, which of the two association rules would be a better predictor of cross-sales? Explain your reasoning, referring to the calculations in part (i).
2 of 2