How Intelligent systems damage the data

A beautiful sight

It’s been a while since we have heard about using Machine Learning to boost the recruitment process. At first, this seems to be a great application of Machine Learning or AI in general, but then, some questions arise. The fact is, if being used recklessly, Intelligent systems will not only bias our perception but also wreck our data, which in turn distorts the accuracy of future predictions.

Let’s say we own a big company with a load of job vacancies and a huge number of applicants for these positions. As browsing through all the CVs will take a lot of time and effort, we decide to make a predictive model to short-list those candidates. We train our model on the data of our current staff’s qualifications to recognize the common characteristics of the ones who have successfully passed our interviews and gotten into the firm. Then, this model rates the applicants and outputs the 20 most suitable candidates out of the original 1000. Finally, we invite those 20 to come for an interview and then take in the best 5.

Everything is still fine up to this point. However, 3 months later, we again have some open positions and call for new-comers. As the number of candidates is giant, we demand the help of our predictive model. We re-train the model to have it more up-to-date and be ready for its usual job: scanning CVs. This is where the problem emerges. Our training data, this time, contains 5 new samples, and these 5 are biased samples. They are biased because the 5 rookies are chosen from the 20 candidates that our old model favored. This is how our old model damaged the data that is used to train the new model, making the new model being affected by what the old model prefers.

Imagine, for instance, that our old model, after carefully analyzing, realized that having a degree in the USA and being white are the best indicators of a good employee while having a degree in Asia and being black are very bad traits. The list of 20 best candidates that it outputted, thus, contains all white guys and/or who graduated from a university in the USA, while the black ones and the ones from Asia are completely ignored. Consequently, our 5 rookies, which is a subset of this list of 20, are all white and/or from the USA. As they are added to our data for training the new model, the traits of being white and having a degree in the USA are further maintained, making the new model favor these two indicators even more. The black and Asian guys can never get through this CV-scanning phase no matter how good they are, just because there was no one with these traits in the company when the first model was built.

The above is an example of how intelligent systems can ruin our work if we do not pay enough attention to how they work. To prevent something like this from happening, we need to be more careful about the input and output of our models, how our models process the input and produce the output, the evaluation metrics and most importantly, the effects and side-effects that the models can make.

Leave a Reply