Overfitting refers to super adaptation during the inductive learning process in machine learning and artificial intelligence projects.
In fact, it may happen that the algorithm follow excessively the entered data when the machine is “trained”, and that it therefore proves to be ineffective when tested on generic data: the model will therefore guarantees credible predictions during the implementation point, but less accurate for “real” data, reducing the smooth functioning of the system and decreasing the reliability of the predictions generated on new data.
How does overfitting happen?
Overfitting occurs when the machine learning model fits the training data so well that it can no longer be generalized in a highly variable data context such as the application in a context that is no longer experimental.
This happens, for example, when the number of attributes to be considered is too high, since the risk of finding an irrelevant input that pollutes the data and creating a compromised decision tree, increases at the same time.
How to recognize it?
Typically, the training data (or a portion of it) is used to further analyze the behavior of the model, testing it by proposing different values by nature and inputs: if high error rates are found, it is likely that overfitting will occur.
Here is a practical example of overfitting: let’s imagine a machine learning model for thermal management:
The model is required to identify changes in temperature and humidity . If many of the temperature changes occur in a common scenario, such as at night, the model may no longer relate temperature to humidity data, but learn to use day/night alternation to classify the data. In this case, the decision tree is distorted and becomes unusable by an error in the classification of the data.
What are the most frequent causes and how to avoid them
The most common causes of overfitting are:
- a small size of the training data. We need to make sure that the training set contains enough samples to represent all the variables and possible inputs (in our example, cover as many events as possible)
- a large amount of irrelevant information contained in the training data. Non-relevant parameters should be selected and removed (in our example: daytime)
- a training oriented only on a sample data set. The model focuses excessively on that particular set and therefore fails to adapt to different data (in our example: finding the right “time”, such as an annual scenario that includes all seasons)
- a model too complex. In these cases, the model itself interprets the training data by identifying the “noises” (in our example: eliminating other irrelevant variables, such as machine downtime for maintenance)
There is therefore another risk: without sufficient relevant data, in fact, we run the opposite risk: underfitting.
Therefore, it is only by properly training the model that it is possible to reduce the percentage of errors, but care must be taken to find the right compromise between the two extremes, underfitting and overfitting.
Machine learning in electrical panels
If we talk about thermal management of electrical panels, Sensis by Fandis is the first IIoT device capable of measuring the climatic parameters in the cabinet and consequently regulating the heating and cooling devices inside it, to maintain the optimal temperature level, processing information and recognizing anomalous events, thanks to predictive analysis.