Data Analytics | Our Portfolio - Demo

1. Basic Statistics (5%)

(a) Exam Grades Analysis for Module A

The exam grades for 20 students in Module A are:

5, 31, 19, 21, 33, 8, 15, 43, 67, 50, 93, 28, 41, 41, 35, 10, 23, 77, 97, 63

(i) Mean:


module_a <- c(5, 31, 19, 21, 33, 8, 15, 43, 67, 50, 93, 28, 41, 41, 35, 10, 23, 77, 97, 63)
mean(module_a)

The mean score for Module A is 40.

(ii) Median:


median(module_a)

The median score for Module A is 34.

(iii) Mode:


table_result <- table(module_a)
modes <- as.numeric(names(table_result[table_result == max(table_result)]))
modes

The mode score for Module A is 41.

Measures of Spread

(i) Variance:


var(module_a)

The variance of Module A scores is 734.7368.

(ii) Standard Deviation:


sd(module_a)

The standard deviation of Module A scores is 27.10603.

(b) Exam Grades Analysis for Module B

The exam grades for 20 students in Module B are:

17, 40, 34, 19, 53, 87, 71, 42, 86, 61, 10, 22, 92, 52, 27, 12, 43, 84, 32, 41

Combining Module A and B Scores

The combined dataset is created by merging the scores from Module A and Module B:


module_b <- c(17, 40, 34, 19, 53, 87, 71, 42, 86, 61, 10, 22, 92, 52, 27, 12, 43, 84, 32, 41)

(i) Covariance and Correlation:


cov(module_a, module_b)
cor(module_a, module_b)

The covariance between Module A and B scores is approximately 60.89. The correlation value is around 0.0854, indicating a weak positive relationship between the two sets of scores.

(ii) Correlation Type:

There is a very weak positive linear relationship between the exam grades of Module A and Module B.

2. Linear Regression (5%)

This section involves simple linear regression analysis using the Auto dataset.

(a) Linear Regression Analysis

Using the lm() function, a simple linear regression was performed with mpg as the response and acceleration as the predictor:


library(ISLR)
attach(Auto)
model <- lm(mpg ~ acceleration, data = Auto)
summary(model)

i. Relationship Between Predictor and Response:

There is a significant relationship between acceleration and mpg as the p-value is less than 0.05.

ii. Strength of Relationship:

The R-squared value is 0.1792, indicating that acceleration explains only 17.92% of the variation in mpg. This suggests a weak predictive strength.

iii. Positive or Negative Relationship:

The relationship is positive, as indicated by the positive coefficient of acceleration (1.1976).

iv. Predicted mpg for Acceleration of 14.0:

The predicted mpg is approximately 21.61. The 95% confidence interval is (21.34, 21.85), and the 95% prediction interval is (7.69, 35.51).

(b) Scatter Plot and Regression Line


plot(module_a, module_b, xlab = "Module A Grades", ylab = "Module B Grades", main = "Scatter Plot")
abline(model, col = "red")

A scatter plot was created with a regression line to visualize the relationship.

3. Classification Tree (5%)

This section focuses on the OJ dataset, analyzing it with a classification tree.

(a) Train-Test Split


set.seed(1)
train_indices <- sample(1:nrow(OJ), 700, replace = FALSE)
train_data <- OJ[train_indices, ]
test_data <- OJ[-train_indices, ]

The training set consists of 700 observations, while the test set has the remaining observations.

(b) Fitting the Tree Model


tree_model <- rpart(Purchase ~ ., data = train_data)
summary(tree_model)

The training error rate is approximately 39.29%. The tree model has 7 terminal nodes.

(c) Plotting the Tree


library(rpart.plot)
rpart.plot(tree_model, box.palette = "Blues", shadow.col = "gray", nn = TRUE)

The decision tree provides insights into customer purchase behavior based on predictor variables.

(d) Test Data Predictions


test_predictions <- predict(tree_model, newdata = test_data, type = "class")
confusion_matrix <- table(test_data$Purchase, test_predictions)
test_error_rate <- sum(confusion_matrix[1, 2] + confusion_matrix[2, 1]) / sum(confusion_matrix)

The test error rate is approximately 19.18%, indicating the model's performance on unseen data.

(e) Optimal Tree Size


cv_result <- cv.tree(tree_model, K = 10)
plot(cv_result$size, cv_result$dev, type = "b", xlab = "Tree size", ylab = "CV error rate")

The cross-validation results help determine the optimal size of the decision tree.

(f) Pruned Tree with 5 Nodes


pruned_tree <- prune.tree(tree_model, best = 5)

A pruned tree with only 5 terminal nodes was created.

(g) Training Error Rate Comparison

The pruned tree has a higher training error rate compared to the unpruned tree.

(h) Test Error Rate Comparison

The pruned tree has a lower test error rate, making it more generalizable to unseen data.

4. Bayesian Networks and Naive Bayes Classifiers (5%)

(1) Conditional Dependency Tables


training_data <- data.frame(
    Income = factor(c("High", "Low", "Low", "High", "Low", "High", "High", "Low", "Low", "Low", "High", "Low", 
                      "Low", "High", "Low", "High", "High", "Low", "Low", "Low", "High", "Low", "High", "Low", 
                      "High", "High", "Low", "Low", "Low", "High")),
    Student = factor(c("True", "False", "True", "False", "True", "False", "True", "True", "False", "True", 
                       "True", "False", "True", "False", "True", "False", "True", "True", "False", "True", 
                       "False", "True", "False", "True", "False", "True", "True", "False", "True", "True")),
    Credit_Rating = factor(c("Fair", "Excellent", "Fair", "Fair", "Excellent", "Fair", "Excellent", 
                             "Fair", "Excellent", "Excellent", "Fair", "Fair", "Fair", "Excellent", 
                             "Fair", "Excellent", "Excellent", "Fair", "Fair", "Excellent", "Excellent", 
                             "Excellent", "Excellent", "Excellent", "Fair", "Excellent", "Fair", "Fair", 
                             "Fair", "Excellent")),
    Buy_Computer = factor(c("Yes", "No", "No", "No", "Yes", "Yes", "Yes", "No", "Yes", "No", 
                            "No", "Yes", "No", "No", "No", "Yes", "No", "Yes", "No", "Yes", 
                            "No", "Yes", "Yes", "No", "No", "No", "Yes", "Yes", "No", "No"))
)
bn <- model2network("[Income][Student][Credit_Rating|Student][Buy_Computer|Income:Credit_Rating]")
fit <- bn.fit(bn, training_data)
fit

The Bayesian network model is fitted with the given dataset, displaying conditional dependency tables for individual features.

(2) Predictions for Testing Instances


testing_data <- data.frame(
    Income = factor(c("High", "Low", "Low", "High", "Low")),
    Student = factor(c("True", "False", "True", "False", "True")),
    Credit_Rating = factor(c("Fair", "Excellent", "Fair", "Fair", "Excellent"))
)
pred <- predict(fit, data = testing_data, node = "Buy_Computer")
pred

Predictions for the test data instances are made using the Bayesian network classifier.

Demo | Rstudiodatalab

Data Analytics | Our Portfolio