COMP 3400 Data Preparation Techniques

This course is required for the Data-centric Computing concentration.

Lab

In addition to classes, this course has one structured laboratory session per week.

⚠︎	From Fall 2027 this course will be a required prerequisite for COMP3202 and COMP3401.

Prerequisites: COMP2001, and Statistics 2500 or Statistics 2550

Availability: This course is usually offered once per year, in Fall or Winter.

Course Objectives

This course gives students basic knowledge on how to pre-process raw data. It enables students to perform data pre-processing in small and large data sets, evaluate the effect of pre-processing techniques using data mining/machine learning methods, and to scale up the pre-processing of large datasets using distributed frameworks.

Representative Workload

Assignments (3) 40%
Quizzes (6) 30%
Final Exam 30%

Representative Course Outline

iPython, Jupyter notebooks, and NumPy basics.
Pandas (mapping, sorting and ranking, and descriptive statistics), Matplotlib
Data cleaning
- Reasons to clean data.
- Identify values for cleaning, formatting, finding outliers and duplicates.
Data scaling, normalization, and discretization
- Min-max scaler, standard scaler, max abs scaler, robust scaler, quantile transformer scaler, power transformer scaler, unit vector scaler.
- Range, clipping, log and z-score normalization.
- Equal width discretization and equal-frequency discretization.
- Binning histogram and correlation analysis for data discretization.
Scikit-learn basics, Supervised Learning (Bayesian, k-Nearest neighbors, Decision trees, Linear models)
- Basics of the scikit-learn package, how to prepare your data, load and execute models.
- Using basic models such as Bayesian, kNearest neighbors, Decision trees, Linear models.
- How cleaning, scaling, normalization and discretization affects supervised learning
Scikit-learn, Unsupervised Learning (Kmeans and DB-SCAN)
- How cleaning, scaling, normalization and discretization affects unsupervised learning.
Scikit-learn, Dimensionality reduction (PCA and TSNe)
- How cleaning, scaling, normalization and discretization affects dimensionality reduction.
Scikit-learn, Feature selection
- Statistics for filter feature selection method, correlation statistics, selection method, and transform variables.
Data integration and encodings
- A data integration primer. How to combine data sets with join, merge and concatenation.
- One-hot encoding.
Map Reduce
- Scaling up the data analysis with the Map Reduce framework. Apache spark basics and examples

Page last updated March 18th 2022

Computer Science
|
Faculty of Science

Computer Science
|
Faculty of Science

Course Objectives

Representative Workload

Representative Course Outline