In a previous post, I mentioned that I had used a regression model to predict age, where age was missing. Whilst the R Squared was higher than a model using the median age based on title, it resulted in a worse prediction.

I've now plotted the standardized variances between observed age and regression age, by age. There is obviously something going on that I don't understand, as there is a clear linear relationship - the standardized variance increases as age increases. We are under-predicting the age of older people (those over, say, 30) and over-predicting the age of younger people.

Not sure what's going on here.

