These are the (Unofficial) Lecture Notes of the Fast.ai Machine Learning for Coders MOOC.

You can find the Official Thread here

This is Part 1/12 Lecture Notes.

Introduction

Alternatives to Local Setup: (With Fast AI Support)

Local Setup instructions:

Assuming you have your GPU and Anaconda Setup (Preferably CUDA ≥9):

$ git clone https://github.com/fastai/fastai$ cd fastai$ conda env update

$ bash | wget files.fast.ai/setup/paperspace

Approach to Learning

Teaching Approach:

“Hey I Just Learned this Concept, and I’ll share about it”

Good Technical Blogs:

Imports

%load ext_autoreload%autoreload 2

If you modify the source code of the imports, you’ll have to reload the kernel in order to reflect these changes.

These two lines auto-reload the Nb incase you change the source.

%matplotlib inline

To plot Figures inline

from fastai.imports import*

Data Science is not Software Engineering. Prototyping models needs things to be done interactively.

import * allows everything to be present, we don’t need to determine the specifics.

Jupyter Tricks

fn_name

?fn_name

??fn_name

Getting the Data

Kaggle: Real World Problems posted by a company/institute.These are really close to real world problems, allow you to check yourself against other competitors.

TL;DR: Perfect place to check your skillset.

Jeremy: “I’ve learnt more from Kaggle competitions than anything else I’ve done in my Entire Life”

OR

OR

Note: Prefer Techniques that will be useful for Downloading Data to your Cloud Compute Instance.Crestle and Paperspace will have most of the Datasets pre-downloaded.

Good Practise: Create a Data Folder for all of your data

To Run BASH Commands in Jupyter

!BASH_COMMAND

To Add Python Commands

!BASH {Python}

Blue Book for Bulldozers:

Goal:

The goal of the contest is to predict the sale price of a particular piece of heavy equiment at auction based on it’s usage, equipment type, and configuaration. The data is sourced from auction result postings and includes information on usage and equipment configurations.

Fast Iron is creating a “blue book for bull dozers,” for customers to value what their heavy equipment fleet is worth at auction.

!head data/bulldozers/Train.csv

Gives the First Few lines.

Structured Data:

(Unoffcial Def) Columns of Data having varying types of Data.

df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=False,parse_dates=["saledate"])

Python 3.6 Formatting:

var ='abc'f'ABC {abc}'

This allows Python to interpret Code inside the {}

Display data:

df_raw

Simply writing this would truncate the output

display_all()def display_all(df):with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000):display(df)

This allows The Complete df to be printed.

display_all(df_raw.tail().T)

Since there are a lot of columns, we have taken Transpose.

Evaluation:

Since the Metric is RMSLE, we would consider the logarithmic values here.

Root mean squared log error: between the actual and predicted auction prices.

Random Forests:

TL;DR: It’s a great Start.

Curse of Dimensionality:

The Greater number of Columns creates emptier Mathematical space where the Data Points sit on the Edges (Math Property).

This leads to distance between points being meaningless.

In General, False.

No Free Lunch Theorem:

There is no Universal kind of Model that works well for all kinds of Dataset.

In general, we look at Data that was created by some cause/structure. There are actually techniques that work well for nearly all of the General Datasets that we work with. Ensembles of Decision Tree is the Technique that is most widely used.

ValueError: could not convert string to float: 'Conventional'

SKLearn isn’t the Best library, but it’s good for our purposes.

RandomForest:

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

Note: Regression!=Linear Regression.

Feature Engineering

The RandomForest Algorithm expects numerical data.

DataSet:

df_raw.saledate

Information inside a Date:

??add_datepart

To look at the source code.

This grabs the field “fldname”Note: df.fldname would literally look up a Field named fldname.

df[fldname] is a safer option in general. It’s a safe bet, doesn’t give weird errors in case we make a mistake. Don’t be lazy to do df(dot)fldname

Also, df[fldname] returns a series.

The function goes through all of the Strings, it looks inside the object and finds attribute with that name. This has been made to create any column that might be relevant to our case. (Exact opposite of the Curse of Dimensionality- We are creating more columns)

There is no harm in adding more columns to your data.

Link getattr()

Pandas splits out different methods inside attributes.

All of the Date time specific linked in pd.dt.___

Finally we drop the column.

Dealing with Strings

train_cats

Creates categorical variables for strings. It creates a column that stores number and stores the mapping of the String and numbers.

Make sure you use the same mapping for training dataset and testing dataset.

Since we’ll have a decision tree that will split the columns. It’ll be better to have a “Logical” order.

RF consists of Trees that make splits. The splits could be High Vs Low+Medium then followed by Low Vs Medium.

Missing Values

display_all(df_raw.isnull().sum().sort_index()/len(df_raw))

Saving

os.makedirs('tmp', exist_ok=True)df_raw.to_feather('tmp/bulldozers-raw')

Feather: Saves the Files in the Format similar to the one in RAM. In layman-it’s fast.

Pro-Tip: Use Temporary folder for all actions/needs that pop up while you’re working.

Final Steps

proc_df

A Function inside the Structured.fastai

df, y, nas = proc_df(df_raw, 'SalePrice')

Running Regressor

m = RandomForestRegressor(n_jobs=-1)m.fit(df, y)m.score(df,y)

1 is the Best Score.

0 is the Worst.

def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),m.score(X_train, y_train), m.score(X_valid, y_valid)]if hasattr(m, 'oob_score_'): res.append(m.oob_score_)print(res)

Checking Overfitting

def split_vals(a,n): return a[:n].copy(), a[n:].copy()

n_valid = 12000 # same as Kaggle's test set sizen_trn = len(df)-n_validraw_train, raw_valid = split_vals(df_raw, n_trn)X_train, X_valid = split_vals(df, n_trn)y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

Final Score

If you’re in the Top Half of the Kaggle LB, it’s a great start.

print_score(m)[0.09044244804386327, 0.2508166961122146, 0.98290459302099709, 0.88765316048270615]

0.25 would get a LB position in the Top 25%

Appreciation: Without any thinking or intensive Feature Engineering, without defining/worrying about any statistical assumption-we get a decent score.

If you found this article to be useful and would like to stay in touch, you can find me on Twitter here.