The next video contains techniques to deal with missing data and turning categorical (non-numerical) data into numbers using Scikit-Learn.
All of the code in the video is correct, however, there is one improvement which should be noted.
In a nutshell, the video shows filling and transforming the entire dataset (X
) and although the code works and runs, it's best to fill and transform training and test sets separately.
I've fixed the code on GitHub for both notebooks (all previous links to these notebooks will work) to reflect this as well as created an end-to-end Colab notebook to reflect the change:
The main takeaways:
Split your data first (into train/test), always keep your training & test data separate
Fill/transform the training set and test sets separately (this goes for filling data with pandas as well)
Don't use data from the future (test set) to fill data from the past (training set)
Keep these in mind when you watch the upcoming video, and remember, full working code is available in the links above.
Thank you Robert for pointing this out on the QA forums.