I am a storyteller who is passionate about data, writing, public speaking and building community. I hold a special place in my heart for Natural Language Processing (spaCy, NLTK,) and inferential modeling (statsmodels, scikit-learn, and using SHAP or LIME explainers.) Additionally, I have experience with data wrangling, scraping, and engineering, (pandas, selenium,) and using machine learning algorithms (decision tree/Random Forest, Linear and Logistic regression analysis.) I also adore making visualizations (seaborn, matplotlib,) and using maps (folium.) I have experience using APIs (Google, Yelp, Twitter) as well as using public data sets (NYC Open Portal.) A firm believer in a well-written code comment, accessible documentation, and better coding with functions.
- Google trends last 5 years of data with the "pumpkin spice" search term
- Performed both additive and multiplicative time series decomposition of data with explinations
- Tested naive models that use last year and last week's data as today's direct prediction
- Tested SARIMA and ETS itterations of models with terms based on both ACF and PACF plots, as well as an itterative approach with PMD Auto Arima
- Best model was an ETS model with MAPE: 12.118428034881385, RMSE: 4.429939957414803 which is roughly 2-3 times better than using last year's data
- Downloaded final model and will verify last year's data in one year for verification blindness.
Natural Language Processing project using sentiment analysis to help find Hurricane Ian survivors in distress, and provide them with links to National Disaster Distress Helpline and FEMA Info.
- Scraped Twitter for over seven thousand original tweets
- Used three different sentiment analysis analyzers (TextBlob, VADER, and a distillBERT model) to find the sentiment of tweets
- Used analysis to explore the most common words in negative and positive sentiments.
- Used Random Forest, Naive Bayes, XGBoost, and Catboost on vectorized tweets for 80% precision in the best-tuned model
- Used Lime Text Explainer for the inferential understanding of the most common words in data, and found words most likely to be in negative tweets
Health information project for early detection and faster diagnosis of pathological fetal heartbeats.
- Built data pipelines and created two early detection algorithms ready for A/B testing
- Tested Logistic Regression, K-Nearest-Neighbors, Support Vector Machine, and Random Forest for the highest recall of pathological class of 92%
- Used OneVsRest wrapper and GridSearchCV for model specialized to pathological class and best-tuned machine.
- Used Shap Values for the inferential understanding of the model, and reduced dimensionality of the algorithm by five input features for 90% recall
Built inferential linear regression model to inform the city of Seattle on where to build new public housing.
- Performed in-depth EDA, determining the effect of renovations over time on property values for efficient budgeting.
- Used Google’s geocoding API and selenium to engineer features: how close each property is to the closest public school, hospital, and police station for infrastructure insights related to property values
- Visualized multicollinearity in seaborn with correlation heat maps, for feature selection
- Iteratively constructed eight versions of the linear regression model, removing features with low P Values for an R-Squared score of 83% Produced automated findings report for human analysis of coefficients, including inferential analysis of inequity by zip code
Advised hypothetical Microsoft Studios on what kinds of feature films to produce by analyzing box office financial data.
- Aggregated data using Pandas and visualized in Seaborn to discover the most popular genres by net profit: Animation, Adventure, Fantasy and Family films
- Pivoted Table to aggregate films of most popular genres by release date to infer seasonal trends: best months for releasing new films in most popular genres to be June, July, May, March, November, and April
- Inferred best run time for new films is between 120-150 minutes based on historical popularity
- Used Pandas to clean 2021 Stop and Frisk Data for later use in a Tableau public viz
- Resulting Visualization used in article EDA with Tableau: 2021 NYPD Stop-and-Frisks by Demographics published by Towards Data Science
In-depth analysis, static geocoding visualization of consent-asked for rates in 2020 for public transparency.
- Visualized frequencies of stops and consent-asked in 2020 NYPD stop and frisk data towards understanding inequities in policing.
- Investigated anomalies and outliers, patterns, and relationships for insights into different zip codes and communicate results with choropleth maps and graphs made in Folium and Seaborn
- Resulting artical Using Folium On Police Data published by Towards Data Science