Applied Data Science (MS) Student Capstone Projects

Additional Navigation

Applied Data Science (MS) Student Capstone Projects

Master Real-World Application with the Case Analysis Capstone

The Case Analysis Capstone (ADS670) in the MS in Applied Data Science program is designed to bridge the gap between theory and practice. This hands-on, project-based course enhances both technical and soft skills, critical competencies often overlooked in traditional coursework but essential for success in modern data science careers.

In this culminating experience, students apply what they've learned throughout the program to tackle real-world data challenges, drive innovation, and effectively communicate their findings to diverse audiences. It’s a dynamic opportunity to refine your analytical abilities while also strengthening leadership, collaboration, and presentation skills.

Explore a selection of original research studies completed by our students, each showcasing the depth, creativity, and applied knowledge developed through the Case Analysis Capstone.

Class of 2025

Viral or Nah?

by Henry R. Alston G'25

What makes a song go viral? Is it a catchy hook, a beat that makes you want to move, or is it something beyond the surface that is encoded in the sound? In today’s modern era of viral content and discovery fueled by algorithms, music virality has become a cultural phenomenon. It has also become a complex data problem. The purpose of this capstone project is to explore the predictive potential of song attributes in determining whether a song will go viral on social media, using a classification-based machine learning approach constructed from real music streaming and social media data.

Three datasets were curated: TikTok trending song tracks that represent virality, two time-sliced Spotify audio feature datasets (November 2018 and April 2019), and a collection of unpopular Spotify songs enriched with genre metadata, serving as the control. After resolving issues such as naming inconsistencies, merging columns, and labeling viral (1) vs. non-viral (0), a combined dataset was constructed, consisting of 6780 tracks. Using XGBoost, a classification model was developed. XGBoost was selected for its ability to handle complex tabular data. A train-test split was applied to the dataset, and audio features such as danceability, tempo, energy, valence (a measure of the song's positive mood), and loudness (in decibels) were extracted. In the combined dataset, a natural class imbalance was observed, with a skewed distribution towards the viral class examples. Therefore, balanced sample weights were applied during training. The final, high-performing model revealed a strong relationship between specific audio features and a song’s viral potential, achieving 99.78% accuracy and a 0.93 ROC AUC. Audio feature importance analysis uncovered that danceability, energy, and valence were the most influential predictors of virality. This suggests music that is emotionally uplifting, rhythmic, and high energy are more likely to go viral on social media platforms like TikTok, Instagram, and YouTube Shorts – known ecosystems that function as viral engines for music in today’s era.

These findings support the hypothesis that a song’s virality can be, at the very least, partially decoded via audio features alone, independent of marketing or momentum created through sharing and word-of-mouth. This study bridges data, music, and digital culture, which offers a new outlook for understanding what makes music go viral through measurable song audio features.

Beyond Demographics: Modeling Academic Risk in Higher Education

by Joanna Anderson G'25

Education has long been a pathway to greater equity. Still, recent cultural shifts have prioritized performance metrics over equity-driven measures, diminishing the role of demographics in identifying at-risk students. Leveraging synthetic data from a Kaggle competition, I evaluate overall predictive performance and examine feature importances to assess both the value of demographics and the potential for non-demographic proxies. This project utilizes XGBoost to compare two approaches: one model that incorporates demographic features and another that omits them entirely. The results show that the nondemographic model outperforms its demographic-inclusive counterpart within this dataset; however, these findings are inherently tied to the specific population represented and therefore cannot be generalized across institutions. Ultimately, while demographic variables may appear less critical, their collection and analysis remain vital for detecting disparate impacts—an assessment that is impossible without demographic data.

Discovering the Right Book: A Sentiment-Aware Recommendation System Program Data Science and AI Applications in E-Commerce

by Mangalambigai Annamalai G'25

Finding the perfect next book in a sea of millions is overwhelming. Readers are often left scrolling endlessly, unsure of what to pick. Our project addresses this challenge by developing an intelligent book recommendation system that utilizes sentiment analysis on real user reviews to provide personalized suggestions. This is not just a tool for convenience—it has the power to transform how readers discover new content and how authors connect with audiences. We leveraged a Kaggle dataset comprising over 3 million Amazon book reviews and more than 200,000 book metadata entries. Using Python's data processing libraries, we cleaned, structured, and labeled reviews with sentiment tags
(positive, neutral, negative). We trained an LSTM (Long Short-Term Memory) deep learning model to classify review sentiments based on text input. This model achieved robust performance and laid the foundation for building a recommendation engine that incorporates both sentiment and book content features. Our initial results show that sentiment-aware 铿乴tering can signi铿乧antly enhance the relevance of recommendations. Readers are more likely to enjoy books with a positively reviewed emotional tone, and this model helps surface those gems. As a next step, we plan to re铿乶e and deploy the system via a user-facing web application for interactive use. This work underscores the value of deep learning in real-world applications and highlights how thoughtful NLP techniques can bridge the gap between user emotion and algorithmic recommendation. It’s more than just data—it’s about delivering joy through the right story.

Forecasting Stock Price Movements: A 10-Ticker Study

by Omar Becerra G'25

In this talk, I present a stock price prediction system built on 10 years of historical data for 10 prominent companies, including AAPL, TSLA, META, and AMZN. My project examines the feasibility and accuracy of forecasting short-term stock movement (7-day direction) using technical indicators and machine learning models, including Logistic Regression, Random Forest, XGBoost, and ARIMA for time series forecasting. I walk through the entire process, from data preprocessing and feature engineering using technical analysis indicators, to model training, evaluation, and visual analytics. I compare performance across models using precision, recall, and F1-score, highlighting how each algorithm handles volatility, momentum, and trend detection. In addition to demonstrating charts, heatmaps, and classi铿乪r outputs, I discuss the real-world implications of prediction accuracy. This serves as a baseline for building market-tracking tools, and offers a clear path forward for incorporating news sentiment and fundamental indicators in future development.

Court Vision: The Science Behind NBA Prediction Accuracy

by Christopher Bratkovics

The NBA operates on gut feelings and conventional wisdom, but what if we could transform speculation into science? This groundbreaking research achieves a 93.9% accuracy in predicting NBA player performance, revolutionizing our understanding of basketball. I analyzed 169,851 game records spanning four seasons (2021-2025), developing a unique dual-approach system: machine learning models that predict with stunning accuracy, paired with statistical tests that explain why players perform as they do. My models predict scoring within 1.22 points, rebounds within 1.06 points, and assists within 0.75 points—accuracy levels that transform guesswork into reliable forecasting. But the real breakthrough came from discovering that feature interactions matter more than individual statistics. Well-rested minutes produce fundamentally different performance than fatigued minutes—an insight that reshapes basketball strategy. I also definitively proved what everyone suspected: rest improves performance, home courts provide real advantages, and three-point shooting has revolutionized the game (all p < 0.01). This creates immediate value: NBA teams can win 2-3 more games through optimized rotations, fantasy players gain edges worth thousands in prize pools, and media can replace speculation with data-driven narratives. By transforming chaotic data into production-ready insights, this work proves that basketball performance isn't random - it's predictable when you have the right analytical vision.

Opioid Use Disorder: Understanding this disease and its consequences

by Norman Ennis G'25

Opioid Use Disorder (OUD) represents a complex, multifactorial condition defined by the maladaptive use of opioids, resulting in physical, psychological, and social impairment. This disease is a chronic, relapsing condition characterized by the compulsive use of opioids despite harmful consequences. One major problem is recognizing problematic opioid use in the clinical setting. The purpose of this research is to utilize Machine Learning with past health data to accurately predict certain factors that might contribute to Opioid Use Disorder. Here, I first obtained the dataset from the Center for Disease Control on overdoses from 2020 to 2023, which contained 145 rows and 270 columns. The columns represented different variables such as race, gender, and age. I proceeded with my analysis by using a Machine Learning Random Forest Regressor, which combines the outputs of multiple decision trees to provide a result for my target, OUD. Some variables were removed because they did not contribute to the analysis, such as drugs other than opioids. Lastly, I used the feature importance method to determine which variables had the most influence on predictions. The results yielded a Mean Absolute Error of 0.127 and a Mean Squared Error of 0.0324, indicating that the model is a good fit for the data and the predictions are close to the actual values. The model highlighted important features, such as Male, Intervention, and ages 35 to 44, as being the most influential. Using Machine Learning on retrospective data can provide accurate predictions on whether an individual might be at risk of developing Opioid Use Disorder.

Predicting Treatment Service Type in Crisis Admissions

by Jesus Komiyama G'25

This study aims to predict treatment service types based on demographic and clinical indicators using the Treatment Episode Data Set (TEDS), released annually by the Substance Abuse and Mental Health Services Administration (SAMHSA). Using a Random Forest model implemented via the Ranger package in R, the classification task spans eight service categories and 65 clinical, demographic, and socioeconomic predictors. Multiple modeling strategies were employed, including hyperparameter tuning, use of alternative split rules, and evaluation of class weights. Models were evaluated using Out-of-Bag (OOB) error rates, confusion matrices, and per-class statistics, including sensitivity, specificity, and balanced accuracy. Initial models achieved modest accuracy with Kappa values indicating fair agreement. Variable importance plots consistently identified NOPRIOR, PRIMINC, AGE, PSOURCE, METHUSE, and RACE as the most influential predictors. To improve performance, the downsampling technique was implemented to address class imbalance. This technique improved the sensitivity for underrepresented classes and increased the accuracy for most categories. Although the best-performing model achieved 50.2% accuracy, the difficulty of improving the accuracy in subsequent models highlights the challenge of modeling imbalanced multi-class outcomes in behavioral health data. This challenge underscores the importance of selecting relevant predictors and class-balancing strategies.

Understanding Customer Sentiment in FIGS Scrub Reviews Using BERT-Based NLP

by Alena LeGros G'25

What if a 5-star rating doesn’t tell the whole story? In the booming healthcare apparel market, understanding how customers feel about products like FIGS scrubs is critical, but star ratings alone often miss nuance. This project explores customer sentiment by analyzing 10,000 real product reviews from the FIGS website. Reviews were collected using a custom-built scraper that accessed the site’s GraphQL API. During the scraping process, a technical challenge was encountered. The FIGS website does not load review pages beyond page 1000, which limits the final dataset to 10,000 reviews. The analysis began with 5,000 reviews for the first work-in progress presentation. Initial techniques included word frequency counts and visualizations, such as heatmaps, to explore how language varied across each star rating. While TF-IDF and VADER were initially tested, they were not utilized in the final project due to limitations related to context and nuanced language. For the final analysis, the BERT language model was used to predict review sentiment on a 1-to-5-star scale based solely on the review text. The results were compared to the original customer ratings using confusion matrices, precision, and recall. This comparison revealed that BERT frequently underestimated reviews, yet it still aligned well with the actual sentiment. The project highlights how advanced natural language models can provide a deeper understanding of customer feedback and uncover patterns that numerical ratings alone may miss.

Predicting Rare Manufacturing Failures using LightGBM

by Dawn Schmidt G'25

This project addresses the challenges of predicting rare manufacturing failures in the Bosch Production Line Performance dataset, a Kaggle dataset that presents an extreme class imbalance. The objective was to maximize recall to ensure defective parts are identified early in the production process. The final modeling pipeline combined median imputation, SMOTE to balance classes, SHAP-based feature selection (selecting the top 400 features), and dimensionality reduction using Principal Component Analysis (PCA) to 50 components. A LightGBM classifier was trained with early stopping and tuned using threshold optimization to favor high recall. The final model achieved a recall of 0.92 on the minority class with an AUC of 0.653 and a Matthews Correlation Coefficient of 0.106. While precision remained low (0.26), this trade-off is acceptable in manufacturing environments where the cost of missing a defect far outweighs the cost of over-flagging. This pipeline demonstrates a robust and interpretable approach for detecting rare events in industrial machine learning applications.

Predicting Hospital Readmissions Among Diabetic Patients

by Pedro Vinhais G'25

Hospital readmissions—especially within 30 days—are a top concern for hospitals, insurers, and patients. Diabetic patients are particularly vulnerable, with each readmission averaging $15,000 and causing added trauma. This project aims to predict readmissions, especially <30-day ones, using structured Electronic Health Record (EHR) data, with a focus on maximizing recall to flag at-risk individuals. Using a public readmissions dataset, I engineered and selected features, addressing class imbalance with class-weighted logistic regression and cross-validation-based probability calibration. The first final model was a calibrated logistic regression, built with 5-fold cross-validation and a lowered threshold to prioritize <30-day readmissions while preserving interpretability. The second model was a combination of Logistic Regression and Random Forest, using separate thresholds for each class to improve recall for general readmissions. SHAP values were used to explain predictions and uncover patterns in misclassified cases. The combo model achieved 68% recall for general readmissions, outperforming other approaches. The calibrated logistic regression achieved a 51% recall and a 31% F1 score for <30-day readmissions, comparable to more complex ensemble models. The top predictive features included inpatient visit frequency, insulin usage, diagnosis complexity, and discharge disposition. Cluster analysis of missed readmissions revealed common traits: steady or no insulin, multiple diagnoses, and frequent prior admissions. This project shows that combining models with threshold tuning can improve class-specific recall. Additionally, a calibrated, class-weighted logistic regression—paired with SHAP explanations—offers a simple, interpretable, and effective solution for identifying early readmission risk, making it suitable for real-world clinical decision support.

Baseball Analytics: Projecting PlayoU Teams for the 2023 MLB Season

by Allen Wimberly G'25

With billions of dollars in sports betting and team bragging rights on the line, using analytics and various models to inform decisions for Major League Baseball (MLB) teams is more critical than ever. Sabermetrics is emerging as a helpful tool that analysts use to make predictions and develop new models to rank teams. This discussion outlines various models and their performance during the 2023 MLB season to accurately predict which teams will reach the playoffs. Run differential, Pythagorean, Bradley-Terry, and a multilinear regression using basic baseball statistics were compared. A sample of the first half of the 2023 season was used to predict the final records for every team. For each model, the predicted division winners, along with the wild-card teams, were compared to the known playo] teams. The Run Differential and Pythagorean models correctly projected 7 out of 12 teams that reached the playoffs, while the Bradley-Terry and multilinear regression models each correctly projected 10 out of 12.

Class of 2024

Traffic Accident Analysis of Maryland Municipalities: a four-year Study

by Nellie Boling G'24

This study analyzed traffic accident data from 20 Maryland municipalities covering January 2015 through January 2024. The study aimed to illuminate common causes of traffic accidents and possible correlations between specific circumstances that might have increased the probability of an accident. These associations could provide insight to inform traffic laws and road infrastructure policy. There is already extensive research on traffic data. My goal with this project was to add to this knowledge, potentially leading to a policy that could improve road safety and indirectly reduce the negative environmental impacts of extensive traffic caused by accidents. The main method used was building association rules to inform a finite set of conditions highly likely to occur and, therefore, might be avoidable with some changes. Visuals were also employed to see the contrast between traffic-related variables. There is a strong trend that most accidents occur in the Rockville municipality, with Gaithersburg being the only municipality coming close to a similar number of traffic accidents. This, of course, correlated with higher rates of injury in accidents in these areas. I plan to use my findings as a jumping-off point to further my research on accidents and their effects on surrounding traffic and the environment.

Personalized Book Recommendation System

by Seth Born G'24

Over the years, the number of online books has made it challenging for readers to discover new and unique books they will love. This project compares the effectiveness of three recommendation systems in predicting user book ratings. The first model is a collaborative filtering, which utilizes Singular Value Decomposition (SVD) to analyze user-item interactions, identifying patterns and preferences. The second model is a content-based filtering model; it employs Term Frequency-Inverse Document Frequency (TF-IDF) and Truncated SVD to convert item attributes into numerical vectors, predicting user preferences based on the content of the books. The third model is a meta-level neural network; it integrates predictions from both models, user age and country. This third model may perform better by leveraging the strength of the other two models. Using an independent data set, the models were evaluated using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). The collaborative filtering model’s performance resulted in the lowest RMSE and MAE. The content-based filtering model’s performance resulted in the highest RMSE and MAE. The meta-model’s performance resulted in slightly higher but comparable results to the collaborative model; however, additional hyper-tuning and different architectures may improve the performance.

Optimizing Workforce Management: Overtime Analysis at Riverton University

by Christian Carde-Guzman

This project analysis investigates the overtime trends and associated costs at Riverton University, utilizing data from July 2021 to May 2024. The primary objectives are to identify patterns in overtime hours, determine the financial impact of these hours, and provide actionable insights for budget forecasting and workforce management. The findings reveal that most overtime hours are concentrated in specific units, with "Dining Services - Culinary" and "Custodial Services" consistently reporting the highest totals. Seasonal peaks in overtime hours are observed, particularly in October, November, February, and January, indicating potential periods of heightened activity. Furthermore, a significant portion of overtime costs are attributed to regular employees, highlighting the dependency on this workforce segment to meet organizational demands. By calculating the total overtime costs based on an average pay rate of $28.75 per hour, the analysis quantifies the financial burden on each unit, with "Dining Services - Culinary" incurring the highest cost at approximately $478,748.56. These insights underscore the need for targeted strategies to manage and potentially reduce overtime expenditures, optimizing resource allocation and enhancing operational efficiency. Generally, this examination provides a foundational understanding of overtime dynamics at Riverton University, offering crucial data to inform strategic planning and improve fiscal oversight.

Predicting Nvidia Stock Market Behavior Using Machine Learning

by Gustavo Feliz G'24

n today’s dynamic financial markets, accurate stock price prediction is essential for strategic investment decisions. In this presentation, I explore an innovative project that employs advanced machine learning techniques to forecast Nvidia's stock prices. My approach utilizes a combination of predictive models, including XGBoost, Random Forest, Support Vector Machines, and Linear Regression, integrating technical indicators and sentiment analysis to achieve high predictive accuracy. A key component of this project is integrating sentiment analysis from Twitter data, providing insights into public opinion and its impact on Nvidia’s stock performance. By analyzing extensive historical data and market sentiment, my model offers a comprehensive view that aids traders and investors in making informed decisions. The project emphasizes the creation of an ensemble model, which combines the strengths of individual algorithms to enhance overall performance. This ensemble approach aims to surpass a 70% accuracy rate in predicting stock movements, offering a robust and reliable tool for financial forecasting. Join me to discover the potential of machine learning in transforming stock market predictions and to understand the critical role of sentiment analysis and ensemble modeling in this process.

Predictive Modeling of Dengue Fever Cases Using Historical and Climate Data

by Megan Gregory G'24

This study uses historical dengue surveillance data and environmental and climate data to predict dengue fever cases. The main objective is to leverage these datasets to predict the occurrence of the next dengue epidemic. Accurate dengue predictions are crucial for public health efforts to mitigate the impact of epidemics, and understanding the relationship between dengue cases and climate change can help improve resource allocations. The methodological approach involved selecting highly correlated variables to the target variable (total cases), normalizing non-normally distributed variables, and handling missing values through forward fill. The dataset was split into training and testing sets, with various models—Negative Binomial, VAR, SARIMA, and LSTM—applied to the training subset. The LSTM model demonstrated the lowest mean absolute error (MAE) and was thus used to forecast dengue cases on the test set, which did not have the total cases variable. The findings suggest that neural network models, such as LSTM, can make progressive predictions while considering prior predictions, which is essential. This project provides a promising direction for future dengue outbreak forecasting, ultimately aiding public health initiatives in proactive epidemic management.

Comparative analysis of deep learning models for blade damage classification in gas turbines: achieving 96% accuracy with inceptionresnetv2

by Orlando Lopez Hernandez G'24

A borescope inspection is a non-destructive visual examination technique used to inspect the interior surfaces and components of a Gas Turbine(GT) that are otherwise inaccessible or difficult to reach. The power generation industry has traditionally relied on expert judgment during borescope inspections to assess the condition of internal components. Historically, attempts to employ image processing to estimate the remaining life of these components were limited by the technology's capabilities. However, advancements in deep learning now enable the extraction of intricate features from images, facilitating the creation of robust classification models. This project utilizes a dataset from a public library containing various aero engine damages, which were manually classified to form the training and validation datasets. Various deep-learning models were investigated, including ResNet50, InceptionV3, Xception, and InceptionResNetV2. Among these, the InceptionResNetV2 model from the Keras library achieved the highest accuracy of 96% after fine-tuning with Keras Tuner. These results confirm that a well-tuned InceptionResNetV2 model can effectively classify blade damage with high accuracy, suggesting potential for further improvement through additional layer tuning.

Exploring how socioeconomic factors influence the cost of childcare and the rate of women's participation in the workforce.

by Precious Uwaya G'24

Childcare costs in the United States are a critical concern for families, with 67% of surveyed parents reporting spending 20% of their annual household income per child on childcare. This significant expense affects economic stability and workforce participation, particularly among women. This capstone project examines socioeconomic factors influencing childcare costs and their correlation with women’s labor participation rates. Utilizing a comprehensive 2018 dataset from The Women’s Bureau of the U.S. Department of Labor, this study analyzes critical socioeconomic factors on childcare costs and women’s workforce participation rate. The research uncovers a stark reality- childcare costs and women's labor participation rates are not uniform across the United States. For instance, the West and Northeast regions bear a heavier financial burden in childcare. In the Northeast, factors such as women's median earnings and the percentage of families in poverty significantly influence labor participation. In the South, the median family income and infant center-based childcare costs shape labor participation rates. For the Midwest, it's families in poverty and women's median earnings. At the same time, in the West, the median family income and school-age center-based childcare costs impact women's participation in the workforce. Model validation across regions indicated that socioeconomic factors are complex and multifaceted, with childcare costs only one part of the broader picture. This study highlights the intricate relationship between childcare costs and women's labor force participation, emphasizing the need for policy interventions to make childcare more affordable by providing region-specific solutions.

Jiu-Jitsu Random Forest Regression Classifier

by Ryan Dearing G'24

This study aims to use Random Forest Classification to predict the outcomes of Brazilian Jiu-Jitsu (BJJ) matches. Brazilian Jiu-Jitsu is a martial art like wrestling, but instead of pinning an opponent, the goal is to submit them with “submissions”. The data used for this study was curated off of the www.bjjheroes.com website, and consists of 1470 match results from 16 athletes entered into the 77 kilograms Abu Dhabi Combat Club Tournament (ADCC). The data includes athlete name, opponent name, match result, method of result, competition name, weight, year, and home gym name. Random forest classification was used as the analysis method to find the predictive relationships between the features. Feature engineering was conducted to create cumulative averages by year for the athletes, compared to the 2024 calendar year results. A cross-validated grid search was conducted to improve the final model further. The most meaningful final features included opponent win rate, win method by points %, and win method by referee decision %, among others. The model saw a dramatic reduction in athlete names being predictive once cumulative statistics were implemented. Future iterations of this project leave room to improve the dataset and the model. Brazilian Jiu-Jitsu data is not centrally located or tracked across the many competitions, and finding more descriptive data on intra-match performance is non-existent.

Development and evaluation of curve fitting models to reduce calibration time for GEM blood-gas analyzer**

By: Celine Breton G'24

The GEM Premier 5000 is a point-of-care blood-gas analyzer that reports
concentrations of analytes in patient blood samples using sensors. It provides
accurate results through frequent calibration using process control solutions (PCS),
which are exposed to the sensors for 55 seconds. During this PCS exposure, mV
readings from the sensors are recorded, and the mV reading from the end of the
soak profile (at t=55 seconds) is used for calibration. The purpose of this study was
to develop and evaluate curve fitting models to reduce the amount of time PCS
solutions need to be exposed to the blood sensors, which would have benefits of
increased instrument availability and allow users to process higher sample volumes.
This study focused on developing methods for PCS A, which is the main calibration
solution, and the pCO2 sensor. 20,000 PCS A soak profiles were used to develop
curve fitting models to predict the end soak profile mV using varying time segments
from 1-37 seconds. Three model types were evaluated: linear, parabolic, and
constrained parabolic. These models were evaluated via two main metrics: the
RMSE of the predicted end soak profile mV versus the actual, as well as the amount
of error using the predicted soak profile mV would introduce into a hypothetical
patient blood sample. The constrained parabolic model with a time frame of 30-37
seconds, and a constrained vertex of 80 performed best with a RMSE of 0.66 and
with 99% of the hypothetical samples calculated using the predicted mV within total
allowable error.

** Video Not available due to the proprietary nature of the topic.

Healthcare field advancement: The implementation of machine learning in diagnosing age-related health conditions

By: Kyle Lacson G'24

Medical diagnosis is an integral process for medical professionals within the healthcare committee as it is essential for proper diagnosis treatment. By leveraging data science principles, researchers and healthcare professionals can utilize machine learning and deep learning methods to develop and draw accurate medical conclusions. Patient health and medical info were collected and anonymized by InVitro Cell Research for data enthusiasts to explore different classification methods and algorithms. The data was pre-processed and visualized to better understand what the data was comprised of. A handful of algorithms that consisted of Gaussian
classifiers, tree classifiers, and deep learning methods were trained and evaluated using multiple variations of the dataset. The XGBoost classifier proved to be the best algorithm for this job and correctly diagnosed 95.96% of the individuals that had a medical complication or not. It was concluded that machine learning and deep learning principles will help develop and advance the healthcare field.

Video Link:

Curious Confounders: A Gestational Age Gene Expression Meta Analysis

By: Esther Malka Laub G'24

Congenital heart disease (CHD) is heart disease that is present at birth. Although congenital heart defects are the most common birth defect in the United States, very little is known about its causes. However, using three publicly available NIH studies, this project explores the gene expression during each of the three trimesters during pregnancy as a baseline study to try to pinpoint those genes that may be the cause of CHD.

The data used is a combined cohort of the three studies which includes cases of healthy infants only to be used as a baseline for future application of a CHD cohort. The data further is split into two datasheets: count data and meta data. The count data is a compilation of the counts of 17,005 different placental genes from each of the 125 placental samples from the three combined studies. In contrast, the meta data accounts for the various variables that were explored in each of the three studies including but not limited to: gestational age, infant sex, and whether the delivery was preterm or not. However, when the analysis reveals a confounding variable, a new analysis plan needs to be developed to produce accurate results.

Video Link:

Predictive Analysis of Violent Crime Rates in the US

By: Christina Myers G'24

This study employs machine learning techniques to predict violent crime rates in US states from 2011 to 2019. The primary objective is to identify the factors contributing to these crime rates to optimize resource allocation and prevention strategies. Among the models employed, XGBoost is the most accurate. The research highlights the importance of education in crime prevention, with education levels ranking as the most influential variable. This study also reveals that the District of Columbia has the highest predicted crime rate while Vermont has the lowest. Identifying states with the lowest crime rates allows for them to be used as potential models to lower crime rates. Ultimately, this research serves as a valuable resource for advocating and directing investments into education, paving the way for a safer and more secure future for our communities.

Video Link:

Absolutely Accurate

By: Shawn Smith G'24

Inaccurate case scheduling can cause major disruption to the operating room and result in a significant financial loss. Throughout this project, I analyzed numerous cases and scheduling guidelines to develop different solutions that would improve the hospital surgical case scheduling accuracy. The first problem identified was the measures used to determine the recent average of a surgeon procedure. These measures were severely flawed and required alternative sample sizes to reflect a more accurate recent average. The second problem identified was that no guidelines were established for the scheduling process, especially when 90% of the scheduled cases are done through the contact center.

The contact center is off-campus and communication with the clinical staff in the operating room is limited, so it is extremely important to develop scheduling guidelines to ensure that surgical cases get scheduled accurately. Future work will include restructuring the metrics for the recent average and establishing surgical guidelines to increase our surgical scheduling case accuracy percentage. Improving the scheduling case accuracy percentage will have a significant impact on the operating room prime time utilization and a reduction in same day case cancellation.

Video Link:

Using Machine Learning to Predict the NBA MVP

By: Zachary Williams G'24

In this project, we effectively took raw data from 2 different data sources, FiveThirtyEight and Basketball Reference, and created a regression-based model to predict who would win the 2022-2023 MVP, what statistics were best at predicting the MVP, and comparing the outcome to Sportsbook betting odds. I choose this project because of my love for the sport, the connection to my industry of online sportsbook betting, and the intrigue of seeing how well the data could predict the NBA MVP. The dataset was run against SVR, Random Forest Regressor, Gradient Boosting Regressor, and KNN regressor, and used R-squared and meansquared-error to measure effectiveness. Overall, the results were incredibly positive, as we received estimates that were similar to the actual MVP outcome for the 2022-2023 NBA season, along with creating a rubric for instituting the code and data for the 2023-2024 season. In the future, the plan is to clean the coding for the project, and run the model against the new season data, start working on other sports or awards, and better improve results for the regressors.

Video Link:

Predictive Modeling for Stroke Detection: A Comprehensive Healthcare Data Analysis

By: Brianna Tittarelli G'24

This project focuses on predicting strokes in healthcare data through a comprehensive datadriven approach. Beginning with data exploration and preprocessing, the study addressed missing values and encoded categorical variables using one-hot encoding. The dataset was split into features and target variables, followed by further division into training and testing sets. To enhance model performance, standardization and normalization techniques were applied to
the features. To tackle class imbalance, Random Over Sampling was employed. Exploratory data analysis techniques, including histograms, scatter plots, and box plots, were utilized to gain insights into the relationships between variables.

The study employed two machine learning models: Logistic Regression and Random Forest. The Logistic Regression model was trained and evaluated on validation and test sets, showcasing promising results. Subsequently, a Random Forest model was employed, further addressing class imbalance with Random Over-Sampling. Hyperparameter tuning using Grid Search improved the Random Forest model's performance. The final model was selected based on the best hyperparameters and demonstrated robust predictive capabilities. Synthetic Minority Over-sampling Technique (SMOTE) was implemented to handle class imbalance, enhancing
model performance. The project provides a comprehensive framework for predictive modeling in healthcare, emphasizing the significance of data preprocessing, feature engineering, and model selection. The Random Forest model, after hyperparameter tuning, emerged as the most effective predictor of strokes. Overall, this study presents a structured approach to predictive modeling, demonstrating its applicability in healthcare data analysis.

Video Link:

Credit Card Fraud Detection

By: Durga Rao Rayapudi G'24

Credit cards plays major role in our day-to-day life as we all mostly use it all the time in-person or online. The usage of credit cards increased drastically with the emergence of internet and ecommerce business. Credit cards are falling into wrong hands either physically or online through several online scams. Email phishing and Data breach are the two main mechanisms intruders use to steal the card information through online. Credit card fraud is a type of identity theft where
an unauthorized user makes a transaction without the cardholder knowledge/approval. Credit card fraud is considered as one of the biggest crimes globally so financial institutes have been trying their best to control it as it is causing severe losses to banks and financial institutes. This loss is not only limited to the banks but also to an individual. I was personally one of the fraud victims in 2016 and there are around 65% people were at least one-time victims of this fraud.
According to Nilson Report, this fraud caused $28.58 billion losses to the financial institutes globally in 2020.

This research aims to identify the credit card fraud at the earlier stage by developing a machine learning model that can predict the given transaction is fraud or not. As this model follows the data science paradigm, so the development of this model happened in multiple phases. In each phase of development, data is understood thoroughly as data is essential and it is mandatory to understand it well in earlier stages, so it helps us to avoid the rework. I started with data collection
phase by collecting the data online by finding the appropriate data source from Kaggle website, which is source of datasets. Data is understood using several statistical, analytical and data visualization techniques. Original data has been updated/removed using several data modification techniques in pre-processing phase. Data is classified into train and test datasets using different algorithms once the data was ready. Model is trained and its accuracy scores are calculated for baseline using Binary Classification algorithms like Random Forest, K-nearest
neighbors, Decision Tree, Logistic Regression, etc. Model is validated/evaluated using test datasets and confirmed its accuracy against train baseline scores. Logistic regression model has done better job with accuracy score 98.2%.

Video Link:

Class of 2023

Predicting Store Weekly Sales: A Case Study of Walmart Historical Sales Data

Stephen Boadi G'23

This capstone project aimed to develop a predictive model for weekly sales of different Walmart stores. The dataset was provided by Walmart as part of a Kaggle competition and contained various features including the store size, location, type, and economic indicators. The project used different regression models, including linear regression, random forest regression, and gradient boosting regression, to predict weekly sales.

After data exploration, preprocessing, and feature engineering, the models were trained and evaluated using the training and validation data. Hyperparameter tuning and feature importance analysis were used to improve the performance of the models. The final best model was selected based on its validation error and compared to the top score in the Kaggle competition leaderboard.

The results showed that the Random Forest Regressor Model had the lowest validation error and was chosen as the best model as it showed a strong predictive power for the problem. The key features that influenced the weekly sales were store size, store type, and department within the store. The model was used to predict the weekly sales for the test data, and the best model was evaluated and compared to the Kaggle competition.

Overall, the project demonstrated using various data cleaning, regression models, hyperparameter tuning, and feature importance analysis to develop a predictive model for weekly sales. The final best model showed promising results, but further improvements could be made with more data and feature engineering.

Predicting Burnout: A Workplace Calculator

Jill Anderson G'23

Is it possible to predict and ultimately prevent burnout? When the pandemic began, employers moved their employees to work from home, where possible. More than two years later, many of these employees have not returned to the office. However, some employees, including myself, prefer to work an office environment. I hypothesize this may be associated with burnout. HackerEarth hosted a competition that ran from October 20 to November 19, 2020. This dataset was also used to create aI took this quiz several times to see how I scored. One of my attempts resulted in a lower score when I increased one of the attributes, how busy I consider myself. Could I create a better model than the survey? Results show that feature selection and regression modeling are efficient for predicting burnout. A predictive model of this type could guide employees and employers and minimize burnout. For example, when an employee approaches a score close to burnout level, they and their manager can have a discussion. This discussion could result in changes that lower the employee’s score before they burn out.

Beyond Artist: Text-to-Image AI

Sarah Snyder G'23

Text-to-image AI is new software that turns a text prompt into stunning images. The prompt can be long or short, detailed or simple, with output that can be in any style or medium the user could imagine. Painting, photography, sculpture, architecture, industrial design, fashion design and more: this new software can do it all in stunning realism that is oftentimes indistinguishable to the real thing. The images are so convincing that even experts in their perspective fields have been fooled when faced with distinguishing real work from AI created work.

Below is a series of 6 images: One out of the six it real and selling for $36,000 at Sotheby’s Auction House, the rest are AI . Can you spot the real painting?

With most of these programs now widely accessible to the public, the art world has been disrupted like never before. What it means to be an artist has been left in free fall as the world decides what the definition of art is. Ethical outrage has erupted over the discovery that these companies are using datasets of images scraped from the internet containing the intellectual property from countless people without their consent or knowledge. Many artists are facing the realistic future of being replaced by AI, while others embrace this new technology.

Follow me as I present the magnitude of these advances, how the software works, uses, applications, controversy, ethics, historical context and more through a captivating 1 hour presentation that takes us beyond the concept of the artist and into the uncharted territory of text-to-image AI.

Predicting the Assessment Value of a Home

Corey Clark G'23

Countless tools have been developed for predicting the sale price of a house; however, predicting the assessment value is a topic yet to be explored thoroughly. Our analysis focused on determining the most important factors for predicting the assessed value of a home, then comparing that with the factors predicting the sales price. Our dataset consisted of parcel information from the municipal office of Warwick, Ri, which was exported from the Vision Government Solutions software using an interface called the Warren Group Extract. We limited our selection to Residential properties with a sale price of over $10,000. Using Lasso and Random Forest regression, we weighted the importance of each feature for predicting both the sale price and the assessed value. The Lasso models were more inconsistent than the Random Forest models. On average, the predicted total assessment using Random Forest was $3,575 less than the actual value. In contrast, the average predicted sale price using Random Forest was $4,250 less than the actual value. Both predictions are within an acceptable range. For both the sale price and total assessed Random Forest models, it was determined that the effective area of the house was the more critical factor. However, as expected, comparing the models resulted in some differences in variable importance. In the assessed value prediction, the Random Forest model identified grade as more important than the total acreage, whereas with the sale price prediction, total acreage had greater importance than grade. Leveraging this model allows municipalities an alternative way to identify the prioritized features stored in their database and helps determine the correct assessed value of their properties with more confidence.

Best In Class Regression: An Analysis of Car Prices

Daniel Duffee G'23

Across the world, cars are used as a means of transportation and the demand for them is huge with there being more than 60 automotive manufacturers globally. With so many different models and features that can be included with them it begs the question, “What is most important in determining the price of a car?”. The project looked to construct a model through regression analysis to answer this question. This study was conducted utilizing a dataset from Kaggle.com that evaluated car prices in the American marketplace across a variety of brands. It consists of 205 entries and contains 25 features ranging from factors affecting the car’s size, performance, engine and more. Multivariate linear regression with recursive feature elimination and random tree forests were used to craft an effective pricing model. Recursive feature elimination was used to get the original dataset down to a more simplistic model where it’s adjusted R-squared value was then tested to see how the model performed. Random tree forests were then used to tackle issues of multicollinearity where RMSE was used to evaluate the new model’s performance. It was found that the most important factors in determining the price of a car were “carwidth”, “curbweight”, “enginesize”, “horsepower”, and “highwaympg.”

The Customer is Always Right: Leveraging Social Media for Customer Feedback

James Kleine-Kracht G'23

The saying goes that "Sales are the voice of the customer," so why not try to leverage their actual voice? In today's world customers are constantly discussing products and services through social media. This project aims to use tweets with specific keywords and apply text mining software to discover real customer feedback. This project focuses on Disney Parks, Experiences, and Products to compare sentiment across multiple avenues. Looking at nearly 700,000 tweets across 27 days, we can compare topics such as Marvel vs. Star Wars and Disney World vs. Disneyland, as well as look at specific topics like Disney Plus or Disney Vacation Club.

Autonomous Maze Solving with a 鈥楥reate2鈥� Robot

Kyle Kracht G'23

Machine vision can be used to guide a robot through a maze. For the robot to successfully navigate the maze, it must know what turns are available to it at any given time, how to activate its motors to move through the maze, and a strategy of what choices to make. To interpret the visible data, it needs to be processed through a convolutional neural network in real time. That necessitates a network architecture that uses very little RAM and computational power. Additionally, to successfully navigate the maze, it must have a very high accuracy, as it will have to make the correct classification many times in a row. The robot must be given explicit directions to execute its maneuvers in the physical space. This is accomplished by writing Python code that sends a velocity value to each motor, a command to wait for a certain period, and then stop. To turn a certain number of degrees, the encoders must be polled and when a certain difference in encoder counts is read, stopping the motors. Finally, the robot requires an algorithmic way in which to approach its decision making. Thankfully there exists an algorithm for solving any geometrically “simply connected” maze, i.e. one in which the internal walls all connect back to the exterior.

Text-mining a decade of conflicts in Africa

Frankline Owino G'23

Violent conflicts have beset Africa for decades, and for most countries, independence has been achieved. Between the year 1997 to 2017, there were 96717 violent conflicts, and these are only the recorded ones. The violent conflicts in Africa are primarily of three kinds: against civilians, riots and protests, and military battles. Between 2012 and 2017 there was a steady increase in violent conflicts. In fact, the rate of increase is likely higher than reported since the areas where many these conflicts took place are inaccessible.
This project examines unstructured text data from the ACLED and Twitter. The data is used to examine associations between violence and factors such as economic prosperity. Sentiment analysis and supervised learning methods are used to probe issues around the violence and assess motivations that are political, or resource based. Sentiment analysis showed that there was a significantly higher number of negative sentiments over the years. The words killed, armed, and police are among those that featured prominently while the word peace was only mentioned 93 times in 23 years. Predictive models, such as Support Vector Machines, Random Forests, Boosting, kNN, were built to predict fatalities and showed at most a 54% accuracy level. A simple binary representation of fatalities outperformed all other models considered, and although performance was not outstanding, it was found to be better than random.

Please visit the MS in Applied Data Science program page to learn about the curriculum, faculty, program options, and more!

91快播

Additional Navigation

Applied Data Science (MS) Student Capstone Projects

Master Real-World Application with the Case Analysis Capstone