This is a repository that contains a list of data mining and machine learning algorithms with python using Anaconda platform. There's also an entire section on machine learning with Apache Spark in order to scale up these techniques to big data analyzed on a computing cluster.
The following techniques used by real data scientists in the tech industry:
- Regression analysis
- K-Means Clustering
- Principal Component Analysis
- Train/Test and cross validation
- Bayesian Methods
- Decision Trees and Random Forests
- Multivariate Regression
- Multi-Level Models
- Support Vector Machines
- Reinforcement Learning
- Collaborative Filtering
- K-Nearest Neighbor
- Bias/Variance Tradeoff
- Ensemble Learning
- Term Frequency / Inverse Document Frequency
- Experimental Design and A/B Tests
In order to practice these techniques I've built the following projects:
- Movie recommendation system using actual user rating data
- Search engine works for Wikipedia data
- Spam classifier
This is a tutorial designed for software programmers who need to learn Python programming language from scratch.
- Mean, median, mode and introducing numpy, scipy and matplotlib
- Standard deviation, population and sample variance
- Data distributions