Data Analyst and Data Scientist

I am a Statistician with a strong foundation in data analysis, statistical modeling, and academic research. Proficient in SPSS, R, and Python, with a keen eye for detail and a commitment to producing accurate, insightful results. Eager to contribute to data-driven projects and grow in a professional analytics environment.

Certifications

Data Analyst with Python (Feb 2023) (DataCamp)
Exploratory and Analyzing Data in Python (Jan 2023) (DataCamp)
Intermediate SQL (Feb 2023) (Data Camp)

Education

University of Southeastern Philippines Bachelor of Science in Statistics Aug 2021- June 2025

Project

This section showcases the data analytics projects I created during my college years and more recently.

Python

Title: A Machine Learning Approach to Weather Prediction in Davao City, Philippines: Integrating K-Means Clustering and K-Nearest Neighbors (k-NN) Objectives:

To develop a k-NN model that accurately predicts relative humidity based on input variables such as visibility, temperature, pressure, and wind speed using the entire dataset.
To create groups of data points with similar characteristics using K-means clustering.
To build k-NN models for each cluster created by K-means clustering.
To compare the predictive performance of the optimal k-NN model and the cluster-based k-NN models to identify the best-performing approach.

Conclusion: The k-NN model applied to the full dataset outperformed the cluster-based approach in predicting relative humidity, suggesting that clustering did not enhance model performance in this case

Code for K-means Clustering integrated with K-Nearest Neighbors (k-NN)

Code for K-Nearest Neighbors (k-NN)

Title: Comparative Analysis of Avocado Ripeness Classification Using Random Forest and k-Nearest Neighbors (k-NN) Objectives:

To develop Random Forest and k-Nearest Neighbors (k-NN).
Aims to assess which machine learning algorithm—Random Forest or k-Nearest Neighbors (k-NN)—yields higher accuracy in classifying the ripeness of avocados.

Conclusion: Among the two models evaluated, the Random Forest Classifier provided the most accurate and consistent performance, making it the most suitable algorithm for classifying avocado ripeness in this study.

Code for Random Forest and k-NN_ Avocado Ripeness Classification

Retrieved Dataset: Avocado Ripeness

Title: Correlational Analysis on Years of Experience and Salary Objective:

To determine the strength and direction of the linear relationship between years of experience and salary using statistical correlation analysis.

Conclusion: The correlation coefficient between Years of Experience and Salary is approximately 0.98, indicating a very strong positive linear relationship. This means that as the number of years of experience increases, the salary also tends to increase significantly. The strength of this correlation suggests that experience is a major factor influencing salary growth.

Code for the Correlational Analysis

Retrieved Dataset: Experience and Salary

R Programming

Title: Real Estate Price Prediction in the Philippines Using Ensemble and Regularization-Based Machine Learning Models Objectives:

To preprocess and transform non-normally distributed variables, including applying a logarithmic transformation to the target variable (Price), and handle missing values appropriately.
To evaluate the predictive performance of four machine learning models—K-Nearest Neighbors (KNN), Ridge Regression, Lasso Regression, and Random Forest—for estimating property prices in the Philippines.
To compare the models’ performance using metrics such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Coefficient of Determination (R²).
To perform cluster analysis on property features (e.g., Bedrooms, Bathrooms, Floor Area, Land Area, Latitude, and Longitude) to identify distinct groups of properties.

Conclusion: The KNN model showed solid performance in predicting property prices, with an RMSE of 3.91, R² of 0.75, and MAE of 3.76 on the test set, indicating a good but slightly imperfect fit. In comparison, Ridge Regression achieved an RMSE of 0.50, R² of 0.84, and MAE of 0.33, demonstrating a better fit with lower prediction errors. Lasso Regression had similar results to Ridge, with an RMSE of 0.50, R² of 0.84, and MAE of 0.38, suggesting that both regularized regression models handled the data effectively. On the other hand, the Random Forest model excelled with an RMSE of 0.40, R² of 0.90, and MAE of 0.25, outperforming the KNN, Ridge, and Lasso models, showcasing its ability to handle complex relationships in the data.

Code for Real Estate Price Prediction in the Philippines Using Ensemble and Regularization-Based Machine Learning Models

Title: A Robust k-NN Model for Breast Cancer Survival Analysis: Tackling Class Imbalance with Upsampling and Downsampling Objectives:

To develop a predictive model for breast cancer survival status (alive or dead) using the k-Nearest Neighbors (k-NN) algorithm
To address class imbalance in the dataset using upsampling and downsampling techniques for fair model training.
To evaluate the classification performance of the k-NN model using metrics such as accuracy, sensitivity, specificity, and AUC (Area Under the Curve).
To analyze the impact of class balancing methods on the predictive performance of the k-NN model

Conclusion: After meticulous tabulation, we obtained the following results: the KNN cross-validation (downsampling) yielded an accuracy of 84.2%, while the KNN-bootstrap (downsampling) yielded an accuracy of 84.5%. As we observe, there is only a tiny difference between the two models. Since kNN-Boot has higher accuracy, we conclude that the kNN model with bootstrapping (downsampling) is better.

Code for A Robust k-NN Model for Breast Cancer Survival Analysis: Tackling Class Imbalance with Upsampling and Downsampling