Blog | AI and Data

Artificial Intelligence Problems and Limitations – Data Overfitting

Artificial Intelligence Problems and Limitations – Data Overfitting
share on
by Sanjeev Kapoor 12 Aug 2019

Artificial Intelligence (AI) is one nowadays of the most trending IT topics, as it empowers novel applications that exhibit human-like capabilities. At the same time, there is a heated debate about the pros and cons of AI when compared to Human Intelligence (HI). AI is to date very efficient when dealing with domain specific problems like chess and GO, where no human can currently beat the top AI-based programs. However, AI performs poorly in problems where intelligence needs to be transferred and applied in different contexts. Humans are much more efficient than computers in generalizing their knowledge and applying it in different settings.

AI and HI can both lead to erroneous decisions, especially when these decisions are biased. Humans have different forms of bias when taking decisions. As a prominent example, humans tend to remember their choices as better than they were, which is known as choice supportive bias. Moreover, they are sometimes overly optimistic, which leads them to overestimate pleasing outcomes as part of wishful thinking or the so-called optimism bias. Furthermore, humans have a tendency to judge decisions based on their eventual outcomes instead of the quality of the decisions at the time they were made, which is conveniently called outcome bias. These are only a few bias examples: In fact, human decisions are subject to many more types of bias.

 

Bias in AI and the Overfitting Problem

Similar to HI systems, AI systems can be biased as well. The most common types of bias for AI systems include:

AI and Data or something else.
Let's help you with your IT project.

  • Sample Bias, which occurs when the training datasets describe insufficiently the problem space of the AI system.
  • Prejudicial Bias, which occurs when the training data sets are influenced by stereotypes or prejudice coming from the population.
  • Measurement Bias, which is associated with faulty measurements that result in the injection of noise in all the data used by the AI system.
  • Search Bias, which is associated with the way the space of available choices is searched. Search bias can lead to the selection of sub-optimal decisions, simply because they are encountered earlier than others in the process of searching through the available decisions space.
  • Language Bias, which refers to limitations in the language used to express the AI model. Some languages are inappropriate for describing certain options of an AI system.

However, the most popular form of bias in AI system is the so-called overfitting bias, which happens when the AI system is built to fit very well the available datasets, but is weak in reliably fitting additional data and/or in predicting future observations. This usually happens when the training dataset (or parts of it) is not representative of the real-world context for the problem at hand. In such cases, the AI and machine learning model is trained to identify patterns that exist in the training dataset, yet they are not valid for additional data and future observations beyond the dataset. In general AI models that are very complex or have very high variance (e.g., flexibility) with respect to the data points of a dataset, are likely to be overfitted.

 

Overcoming the Overfitting Problem

Overfitting leads to AI models with poor performance such as poor accuracy in the case of predictive analytics. Therefore, data scientists strive to avoid overfitting bias based on one or more of the following measures:

  • Adding more data in the training and testing processes: As outlined, overfitting leads to the identification of patterns that hold for the training dataset, but are not general applicable to other datasets. Therefore, data scientists are sometimes using more data in order to retrain and test a model. Such additional data can help identifying whether the original model was overfitted or not. They also help in deriving in new models that work for the wider dataset and are less likely to suffer from overfitting bias.
  • Take advantage of domain knowledge: A business domain expert is the most appropriate person to judge whether a pattern/rule holds in practice or whether some overfitting has happened. For example, an AI-based predictor of house prices, when trained on a given dataset, may use the “door color” as a predictor attribute of the price. A domain expert can decide whether this is plausible or a result of overfitting. That’s one of the reasons why business experts should be present in data science teams.
  • Generalized machine learning models: Detailed and complex AI models are likely to result from the overfitting bias. Therefore, data scientists tend to dispose with more general models, even though they might yield slightly less performance on the training dataset, when compared to the more detailed ones. Choosing generalized models is therefore a good tactic for avoiding overfitting issues.
  • Regularization: Regularization is also motivated from the fact that models that overfit the data tend to be very complex (e.g., they tend to have too many and very specific parameters). In this context, regularization introduces additional information that helps relaxing the complexity of the model, which subsequently prevents overfitting. Note however that regularization comes with a performance penalty in the model building and deployment process.

 

Overall, there are known and tested methods for alleviating the overfitting bias. In practice, applying these methods is challenging, as data scientists have to deal with other related problems, such as lack of appropriate datasets, poor data sampling and data collection processes, problems in understanding business processes and the social context of the problems at hand, shortage of domain experts, as well as the proclaimed talent gap in data scientists and AI engineers. Therefore, despite the above ways for overcoming overfitting, there are still AI systems that suffer from this problem. Nevertheless, this should not be seen as a set-back to building, deploying and using AI. In the years to come more data will be gradually made available, along with more computing cycles that will allow their faster processing. More data will lead to more credible and accurate AI systems that will suffer less from overfitting and other forms of AI-related bias. In the meantime, AI experts should be prepared to timely identify and confront the bias issues.

Leave a comment

Recent Posts

get in touch

We're here to help!

Terms of use
Privacy Policy
Cookie Policy
Site Map
2020 IT Exchange, Inc