COMP 3400 Data Preparation Techniques
This course is required for the Data-centric Computing Stream.
Lab | In addition to classes, this course has one structured laboratory session per week. |
Prerequisites: COMP2001, and Statistics 2500 or Statistics 2550
Availability: This course is usually offered once per year, in Fall or Winter.
Course Objectives
This course gives students basic knowledge on how to pre-process raw data. It enables students to perform data pre-processing in small and large data sets, evaluate the effect of pre-processing techniques using data mining/machine learning methods, and to scale up the pre-processing of large datasets using distributed frameworks.
Representative Workload
- Assignments (3) 40%
- Quizzes (6) 30%
- Final Exam 30%
Representative Course Outline
- iPython, Jupyter notebooks, and NumPy basics.
- Pandas (mapping, sorting and ranking, and descriptive statistics), Matplotlib
- Data cleaning
- Reasons to clean data.
- Identify values for cleaning, formatting, finding outliers and duplicates.
- Data scaling, normalization, and discretization
- Min-max scaler, standard scaler, max abs scaler, robust scaler, quantile transformer scaler, power transformer scaler, unit vector scaler.
- Range, clipping, log and z-score normalization.
- Equal width discretization and equal-frequency discretization.
- Binning histogram and correlation analysis for data discretization.
- Scikit-learn basics, Supervised Learning (Bayesian, k-Nearest neighbors, Decision trees, Linear models)
- Basics of the scikit-learn package, how to prepare your data, load and execute models.
- Using basic models such as Bayesian, kNearest neighbors, Decision trees, Linear models.
- How cleaning, scaling, normalization and discretization affects supervised learning
- Scikit-learn, Unsupervised Learning (Kmeans and DB-SCAN)
- How cleaning, scaling, normalization and discretization affects unsupervised learning.
- Scikit-learn, Dimensionality reduction (PCA and TSNe)
- How cleaning, scaling, normalization and discretization affects dimensionality reduction.
- Scikit-learn, Feature selection
- Statistics for filter feature selection method, correlation statistics, selection method, and transform variables.
- Data integration and encodings
- A data integration primer. How to combine data sets with join, merge and concatenation.
- One-hot encoding.
- Map Reduce
- Scaling up the data analysis with the Map Reduce framework. Apache spark basics and examples
Page last updated March 18th 2022