CS699 – Spring 2025 Project Report

Author

Kevin Adrian Lopez Sepulveda, Dirgha Pareshkumar Jivani

1 Cover Page

This report summarizes the work conducted for the CS699 project during the Spring 2025 semester. Our project focused on building and evaluating classification models using a dataset from the 2023 American Community Survey.

The primary goal was to explore various classification techniques and evaluate their performance in predicting whether individuals have difficulty living independently.

The dataset contained 4318 entries and 117 variables, with the class attribute indicating whether individuals face challenges in living independently.

We implemented different models, analyzed their strengths and weaknesses, and included data preprocessing, balancing techniques, and model evaluation.

2 Data Mining Tools

We used R for all data mining and modeling work, supported by several key packages:

tidyverse: For data manipulation and visualization.
caret: For preprocessing, feature selection, and model training.
e1071: For algorithms like SVM.
randomForest: For Random Forest classification.
nnet: For training neural networks.
rpart: For building decision trees.
pROC: For ROC and AUC evaluation.

3 Classification Algorithms Used

In our analysis, we implemented several classification algorithms to predict the target variable, which indicates whether a person has difficulty living independently. The models we evaluated include:

Logistic Regression - This model estimates the probability of a binary outcome based on one or more predictor variables. It is a widely used method for binary classification tasks.
Decision Trees - A decision tree uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. Decision trees are easy to interpret and visualize.
Random Forest - This algorithm builds multiple decision trees and merges them together to get a more accurate and stable prediction. It is known for its robustness and ability to handle large datasets.
Neural Networks - These models are inspired by the human brain and consist of layers of interconnected nodes, allowing for complex relationships in the data to be learned. They are particularly effective for large datasets with non-linear relationships.
Support Vector Machines (SVM) - This algorithm finds the hyperplane that best separates different classes in the feature space. SVMs are effective in high-dimensional spaces and are commonly used for classification tasks.
k-Nearest Neighbors (k-NN) - This method classifies instances based on the classes of their nearest neighbors in the feature space. It is simple and effective, especially for smaller datasets.

4 Data Mining Procedure

4.1 Step 1: Dataset Preparation

We began by loading our dataset, which was part of the 2023 American Community Survey. The dataset contained 4318 entries with 117 variables, with the class attribute indicating whether individuals had difficulties living independently.

4.2 Step 2: Data Cleaning

Next, we performed data cleaning by removing any columns that had more than 50% missing values. This step was crucial to ensure that our analysis was not skewed by incomplete data. For the remaining missing values, we applied a simple imputation method, filling them with zeros.

4.3 Step 3: Variable Transformation

The target variable “Class” was converted into a factor to ensure it was treated correctly by the classification algorithms. This transformation is essential for accurate model training and evaluation.

4.4 Step 4: Feature Scaling

To standardize our numeric features, we applied scaling techniques to ensure that all features contributed equally to the model training process. This step helps improve model performance, especially for algorithms sensitive to the scale of input features.

4.5 Step 5: Train-Test Split

We split the dataset into training and testing sets, using 80% of the data for training the models and the remaining 20% for testing their performance. This split is crucial for evaluating how well our models can generalize to unseen data.

4.6 Step 6: Balancing Techniques

To address the issue of class imbalance in our dataset, we implemented two sampling techniques: undersampling and oversampling. The undersampling method reduced the number of instances in the majority class, while the oversampling method increased the number of instances in the minority class by randomly sampling with replacement. This balancing step is vital for building robust models that can accurately predict both classes.

4.7 Step 7: Model Training & Evaluation

We trained multiple classification models on both the undersampled and oversampled datasets. A custom evaluation function was used to calculate performance metrics, including confusion matrices, accuracy, sensitivity, specificity, precision, recall, F1 score, and ROC AUC. These metrics provided a comprehensive view of each model’s performance.

5 Data Mining Results and Evaluation

5.1 Logistic Regression

Confusion Matrix

	No	Yes
No	789	44
Yes	11	19

Metrics

Accuracy: 0.9546
Sensitivity: 0.6333
Specificity: 0.9798
Precision: 0.9597
Recall: 0.6333
F1 Score: 0.7657
ROC AUC: 0.7566

5.2 Decision Tree

Confusion Matrix

	No	Yes
No	800	63
Yes	0	0

Metrics

Accuracy: 0.9270
Sensitivity: 1.0000
Specificity: 0.0000
Precision: 0.9270
F1 Score: 0.9621
ROC AUC: 0.5000

5.3 Random Forest

Confusion Matrix

	No	Yes
No	792	47
Yes	7	16

Metrics - Accuracy: 0.9363 - Sensitivity: 0.9805 - Specificity: 0.2492 - F1 Score: 0.9781 - ROC AUC: 0.6641

5.4 Neural Network

Confusion Matrix

	No	Yes
No	800	63
Yes	0	0

Metrics

Accuracy: 0.9270
Sensitivity: 1.0000
Specificity: 0.0000
F1 Score: 0.9621
ROC AUC: 0.5000

6 Model Performance Comparison

Model	Sensitivity	Specificity	Precision	F1 Score	ROC AUC
GLM (Under)	0.7788	0.8571	0.9858	0.8701	0.8180
GLM (Over)	0.8238	0.8095	0.9821	0.8960	0.8166
rpart (Under)	1.0000	0.0000	0.9270	0.9621	0.5000
rpart (Over)	1.0000	0.0000	0.9270	0.9621	0.5000
RF (Under)	0.7875	0.8254	0.9828	0.8744	0.8064
RF (Over)	0.9825	0.3492	0.9504	0.9662	0.6659
nnet (Under)	0.6700	0.8254	0.9799	0.7958	0.7477
nnet (Over)	0.8175	0.7937	0.9805	0.8916	0.8056

7 Discussion

The results from our models reveal significant differences in performance between the undersampled and oversampled datasets. For instance, the GLM model performed better with oversampling, achieving higher accuracy and sensitivity, indicating its improved ability to identify instances of difficulty living independently. Conversely, the decision tree and neural network models showed consistently poor specificity across both sampling methods, often misclassifying many positive instances. Using the confusion matrices allows us to understand not only the overall accuracy of the models but also how well they are performing on each class. The F1 scores provide a balance between precision and recall, which is particularly important when dealing with imbalanced datasets. The ROC AUC scores further enhance our understanding of each model’s ability to discriminate between classes, with higher values indicating better performance.

8 ROC Curve Comparison

The ROC (Receiver Operating Characteristic) curves provide a comprehensive view of each model’s ability to classify instances correctly across various thresholds. By plotting the true positive rate (sensitivity) against the false positive rate (1-specificity), we can visualize the trade-off between sensitivity and specificity for each model. In our analysis, the random forest model demonstrated a superior ROC curve, indicating that it effectively balances sensitivity and specificity, particularly at higher thresholds. This means that it correctly identifies a significant proportion of individuals who have difficulty living independently while maintaining a low rate of false positives. Conversely, models like the decision tree and neural network showed less favorable ROC curves, as they tended to have lower true positive rates and higher false positive rates, especially when faced with class imbalance. Overall, the ROC curve analysis reinforces our findings from the confusion matrices and performance metrics, highlighting the random forest’s robustness in classification tasks. By considering the area under the curve (AUC) values, which quantify the overall performance, we found that the random forest achieved the highest AUC, followed closely by logistic regression. These results emphasize the importance of model selection based on ROC analysis, as it provides insights into the models’ performance beyond simple accuracy metrics.

9 Conclusion

In conclusion, we successfully built and evaluated multiple classification models to predict whether individuals have difficulty living independently using a dataset from the 2023 American Community Survey. Throughout the project, we employed R as our primary tool, utilized various preprocessing techniques, and implemented sampling methods to balance our datasets. The results showed varied performance across models, highlighting the importance of selecting the appropriate model based on specific metrics. The comparison of models indicated that while logistic regression and random forest models performed relatively well, other models like decision trees and neural networks faced challenges with class imbalance, leading to misleading metrics. Moving forward, we can refine our models further by incorporating more sophisticated feature selection methods and exploring additional algorithms. This project provided valuable insights into the complexities of machine learning and the significance of data preprocessing in achieving high-performance models. Conclusion on Model Performance In summary, the models produced varied results, showcasing the strengths and weaknesses of each approach. The GLM and Random Forest models generally exhibited higher accuracy and sensitivity in the oversampled datasets, indicating their effectiveness in identifying individuals who have difficulty living independently. However, the Decision Tree and Neural Network models faced challenges, particularly with sensitivity and specificity in certain datasets. The visualizations provided further insights, allowing us to make informed decisions about the most suitable models for our analysis. Next Steps Given the performance metrics and visual comparisons, future work may focus on refining the models, exploring advanced tuning techniques, and investigating additional algorithms that may further improve prediction accuracy. This project has provided a solid foundation for understanding the intricacies of classification modeling using real-world datasets.

10 Division of Work

Kevin Adrian Lopez Sepulveda: Responsible for data preprocessing, implementing sampling methods, training models, and evaluating their performance.
Dirgha Pareshkumar Jivani: Assisted in data exploration, implementation of algorithms.

11 References

U.S. Census Bureau. (2023). American Community Survey. Retrieved from American Community Survey.

12 Appendix

The Full Zip File can be found below, it contains the r files and the dataset used in this project. Download Full Report(ZIP FILE)