In the previous few videos we worked on filling the missing data in the training and validation data before splitting it into training and validation sets using the following code:

# Split data into training and validation
df_val = df_tmp[df_tmp.saleYear == 2012]
df_train = df_tmp[df_tmp.saleYear != 2012]

The code worked but how might this interfere with our model?

Remember the goal of machine learning: use the past to predict the future.

So if our validation set is supposed to be representative of the future and we’re filling our training data using information from the validation set, what might this mean for our model?

The challenge here comes in two parts.

  1. What does it mean if we fill our training data with information from the future (validation set)?

  2. How might you implement a fix to the current way things are being done in the project?

If you need a hint, remember some takeaways from a previous lecture:

Keep these things in mind when we create a data preprocessing function in a few videos time, they'll help you answer the question which gets raised then too.