1. Basic Statistics (5%)
(a) Exam Grades Analysis for Module A
The exam grades for 20 students in Module A are:
5, 31, 19, 21, 33, 8, 15, 43, 67, 50, 93, 28, 41, 41, 35, 10, 23, 77, 97, 63
(i) Mean:
module_a <- c(5, 31, 19, 21, 33, 8, 15, 43, 67, 50, 93, 28, 41, 41, 35, 10, 23, 77, 97, 63)
mean(module_a)
(ii) Median:
median(module_a)
(iii) Mode:
table_result <- table(module_a)
modes <- as.numeric(names(table_result[table_result == max(table_result)]))
modes
Measures of Spread
(i) Variance:
var(module_a)
(ii) Standard Deviation:
sd(module_a)
(b) Exam Grades Analysis for Module B
The exam grades for 20 students in Module B are:
17, 40, 34, 19, 53, 87, 71, 42, 86, 61, 10, 22, 92, 52, 27, 12, 43, 84, 32, 41
Combining Module A and B Scores
The combined dataset is created by merging the scores from Module A and Module B:
module_b <- c(17, 40, 34, 19, 53, 87, 71, 42, 86, 61, 10, 22, 92, 52, 27, 12, 43, 84, 32, 41)
(i) Covariance and Correlation:
cov(module_a, module_b)
cor(module_a, module_b)
The covariance between Module A and B scores is approximately 60.89. The correlation value is around 0.0854, indicating a weak positive relationship between the two sets of scores.
(ii) Correlation Type:
There is a very weak positive linear relationship between the exam grades of Module A and Module B.
2. Linear Regression (5%)
This section involves simple linear regression analysis using the Auto dataset.
(a) Linear Regression Analysis
Using the lm()
function, a simple linear regression was performed with mpg
as the response and acceleration
as the predictor:
library(ISLR)
attach(Auto)
model <- lm(mpg ~ acceleration, data = Auto)
summary(model)
i. Relationship Between Predictor and Response:
There is a significant relationship between acceleration and mpg as the p-value is less than 0.05.
ii. Strength of Relationship:
The R-squared value is 0.1792, indicating that acceleration explains only 17.92% of the variation in mpg. This suggests a weak predictive strength.
iii. Positive or Negative Relationship:
The relationship is positive, as indicated by the positive coefficient of acceleration (1.1976).
iv. Predicted mpg for Acceleration of 14.0:
The predicted mpg is approximately 21.61. The 95% confidence interval is (21.34, 21.85), and the 95% prediction interval is (7.69, 35.51).
(b) Scatter Plot and Regression Line
plot(module_a, module_b, xlab = "Module A Grades", ylab = "Module B Grades", main = "Scatter Plot")
abline(model, col = "red")
A scatter plot was created with a regression line to visualize the relationship.
3. Classification Tree (5%)
This section focuses on the OJ dataset, analyzing it with a classification tree.
(a) Train-Test Split
set.seed(1)
train_indices <- sample(1:nrow(OJ), 700, replace = FALSE)
train_data <- OJ[train_indices, ]
test_data <- OJ[-train_indices, ]
(b) Fitting the Tree Model
tree_model <- rpart(Purchase ~ ., data = train_data)
summary(tree_model)
The training error rate is approximately 39.29%. The tree model has 7 terminal nodes.
(c) Plotting the Tree
library(rpart.plot)
rpart.plot(tree_model, box.palette = "Blues", shadow.col = "gray", nn = TRUE)
The decision tree provides insights into customer purchase behavior based on predictor variables.
(d) Test Data Predictions
test_predictions <- predict(tree_model, newdata = test_data, type = "class")
confusion_matrix <- table(test_data$Purchase, test_predictions)
test_error_rate <- sum(confusion_matrix[1, 2] + confusion_matrix[2, 1]) / sum(confusion_matrix)
The test error rate is approximately 19.18%, indicating the model's performance on unseen data.
(e) Optimal Tree Size
cv_result <- cv.tree(tree_model, K = 10)
plot(cv_result$size, cv_result$dev, type = "b", xlab = "Tree size", ylab = "CV error rate")
The cross-validation results help determine the optimal size of the decision tree.
(f) Pruned Tree with 5 Nodes
pruned_tree <- prune.tree(tree_model, best = 5)
A pruned tree with only 5 terminal nodes was created.
(g) Training Error Rate Comparison
The pruned tree has a higher training error rate compared to the unpruned tree.
(h) Test Error Rate Comparison
The pruned tree has a lower test error rate, making it more generalizable to unseen data.
4. Bayesian Networks and Naive Bayes Classifiers (5%)
(1) Conditional Dependency Tables
training_data <- data.frame(
Income = factor(c("High", "Low", "Low", "High", "Low", "High", "High", "Low", "Low", "Low", "High", "Low",
"Low", "High", "Low", "High", "High", "Low", "Low", "Low", "High", "Low", "High", "Low",
"High", "High", "Low", "Low", "Low", "High")),
Student = factor(c("True", "False", "True", "False", "True", "False", "True", "True", "False", "True",
"True", "False", "True", "False", "True", "False", "True", "True", "False", "True",
"False", "True", "False", "True", "False", "True", "True", "False", "True", "True")),
Credit_Rating = factor(c("Fair", "Excellent", "Fair", "Fair", "Excellent", "Fair", "Excellent",
"Fair", "Excellent", "Excellent", "Fair", "Fair", "Fair", "Excellent",
"Fair", "Excellent", "Excellent", "Fair", "Fair", "Excellent", "Excellent",
"Excellent", "Excellent", "Excellent", "Fair", "Excellent", "Fair", "Fair",
"Fair", "Excellent")),
Buy_Computer = factor(c("Yes", "No", "No", "No", "Yes", "Yes", "Yes", "No", "Yes", "No",
"No", "Yes", "No", "No", "No", "Yes", "No", "Yes", "No", "Yes",
"No", "Yes", "Yes", "No", "No", "No", "Yes", "Yes", "No", "No"))
)
bn <- model2network("[Income][Student][Credit_Rating|Student][Buy_Computer|Income:Credit_Rating]")
fit <- bn.fit(bn, training_data)
fit
The Bayesian network model is fitted with the given dataset, displaying conditional dependency tables for individual features.
(2) Predictions for Testing Instances
testing_data <- data.frame(
Income = factor(c("High", "Low", "Low", "High", "Low")),
Student = factor(c("True", "False", "True", "False", "True")),
Credit_Rating = factor(c("Fair", "Excellent", "Fair", "Fair", "Excellent"))
)
pred <- predict(fit, data = testing_data, node = "Buy_Computer")
pred
Predictions for the test data instances are made using the Bayesian network classifier.