Downloading and Data Extraction Pipeline

For small data, the download could be a simple task. But in BigData regime, data download/transfer or server migration is a challenging task. One need to use command lines in shell scripts to transfer data from one remote server to another remote server. For example, sometimes data may be corrupted during the downloading process, an MD5-checksum need to be created and tested for integrity of the download process.

Lectures : Using Python to Access Web Data | Data Wrangling with MongoDB | Introduction to SQL

Tutorials: os | wget

Data Parsing and Cleaning Pipeline

Downloaded or extracted data are usually not in a clean format. For example, it could be an XML file with information embedded in a hierarchical tree structure. Based on the form of extracted data, you may need to implement data parsing libraries or self-written programming.

Lectures : Getting and Cleaning Data | SQL for Data Science | Using Databases with Python | SQL for Data Analyst | Algorithms on Strings

Tutorials: lxml | re

Preprocessing & Feature Engineering

Cleaned data may need some preprocessing steps, for example, one hot encoding, label encoding, data transformation, normalization, etc. After preprocessing, usually, feature importance/dependency is checked by the underlying statistical properties of the data. Feature Engineering is a brilliant step for model design at online streaming of data where specific features are selected and passed through multiple preprocessing steps in a pipeline before fed into the model training.

Lectures : Natural Language Processing | Applied Text Mining in Python | Feature Engineering | Feature Engineering for Improving Learning Environments

Tutorials: Data Transformation | Feature Extraction | Dimentionality Reduction

Machine Learning (Regression, Classification & Clustering)

Machine learning is a process to teach an algorithm to work for your benefits. If you train your algorithm with already known features and labels, it is called supervised learning (Regression, classification, etc.). If you train with features only(no labels), it is unsupervised learning (Clustering, Self-organizing map etc). Best features selected from feature engineering are passed through machine learning pipeline. It builds a model by implementing model tuning, where the error is reduced, and accuracy is increased, maintaining bias-variance trade-off.

Tutorials: Supervised Learning | Clustering

Interactive Data Visualization

Data visualization is the way to advertise the power of data science. There are two ways to present data: as a static view (png, jpg, pdf, etc.) and as an interactive app in a website using javascript, plotly, tablue, etc. In interactive data visualization, viewers can interact with the visualization app and elevate the level of understanding of the presented matter.

Lectures : Data Visualization Specialization(4 courses) | Data Visualization with Tableau Specialization (5-courses) | Data Visualization and D3.js

Tutorials: Plotly | Bokeh | Introduction to D3.js

Intermediate

Downloading and Data Extraction Pipeline

Data Parsing and Cleaning Pipeline

Preprocessing & Feature Engineering

Machine Learning (Regression, Classification & Clustering)

Interactive Data Visualization