COMP 3400 Data Preparation Techniques

This course is required for the Data-centric Computing Stream.

Lab In addition to classes, this course has one structured laboratory session per week.

Prerequisites:  COMP2001, and Statistics 2500 or Statistics 2550

Availability: This course is usually offered once per year, in Fall or Winter.

Course Objectives

This course gives students basic knowledge on how to pre-process raw data. It enables students to perform data pre-processing in small and large data sets, evaluate the effect of pre-processing techniques using data mining/machine learning methods, and to scale up the pre-processing of large datasets using distributed frameworks.

Representative Workload
  • Assignments (3) 40%
  • Quizzes (6) 30%
  • Final Exam 30%
Representative Course Outline
  • iPython, Jupyter notebooks, and NumPy basics.
  • Pandas (mapping, sorting and ranking, and descriptive statistics), Matplotlib
  • Data cleaning
    • Reasons to clean data.
    • Identify values for cleaning, formatting, finding outliers and duplicates.
  • Data scaling, normalization, and discretization
    • Min-max scaler, standard scaler, max abs scaler, robust scaler, quantile transformer scaler, power transformer scaler, unit vector scaler.
    • Range, clipping, log and z-score normalization.
    • Equal width discretization and equal-frequency discretization.
    • Binning histogram and correlation analysis for data discretization.
  • Scikit-learn basics, Supervised Learning (Bayesian, k-Nearest neighbors, Decision trees, Linear models)
    • Basics of the scikit-learn package, how to prepare your data, load and execute models.
    • Using basic models such as Bayesian, kNearest neighbors, Decision trees, Linear models.
    • How cleaning, scaling, normalization and discretization affects supervised learning
  • Scikit-learn, Unsupervised Learning (Kmeans and DB-SCAN)
    • How cleaning, scaling, normalization and discretization affects unsupervised learning.
  • Scikit-learn, Dimensionality reduction (PCA and TSNe)
    • How cleaning, scaling, normalization and discretization affects dimensionality reduction.
  • Scikit-learn, Feature selection
    • Statistics for filter feature selection method, correlation statistics, selection method, and transform variables.
  • Data integration and encodings
    • A data integration primer. How to combine data sets with join, merge and concatenation.
    • One-hot encoding.
  • Map Reduce
    • Scaling up the data analysis with the Map Reduce framework. Apache spark basics and examples

Page last updated March 18th 2022