Understanding Underfitting and Overfitting in Machine Learning
Understanding Underfitting and Overfitting in Machine Learning
Definitions:
- Fit: The ability of a model to accurately describe or predict outcomes based on the given data.
- Underfitting: Occurs when a model fails to capture the underlying pattern of the data, typically leading to poor performance on both training and new data.
- Overfitting: Happens when a model captures noise instead of the underlying data pattern, performing well on training data but poorly on new, unseen data.
Example Using XGBoost Model:
- Models can exhibit underfitting, an ideal fit, or overfitting, depending on how well they generalize from the training data to unseen data.
- Overfitting
Datasets are a subset of the whole
Our data: a population or sample
Monitoring and Managing Fit in Models
Real-time Data Monitoring:
- Datasets in databases are often updated in real-time; thus, it’s crucial to analyze these using a fixed snapshot without altering the original data unless absolutely necessary.
- Selecting a subset that well-represents the entire dataset is crucial for effectively training machine learning models.
Identifying Fit Issues:
By examining the error rates on datasets not used in training (validation datasets), you can determine if a model is underfitting or overfitting.
From a certain point, it can be observed that the error on the dataset begins to increase. This can serve as a criterion to determine whether a machine learning model is overfitting or underfitting.
Techniques to Control Overfitting
Regularization:
- Introducing regularization techniques helps reduce model sensitivity to noise. These include:
Early Stopping: Halting training when validation error begins to increase, despite improvements in training error(Also available for structured data).
Training error continues to decrease, but at a certain point, the validation error begins to increase. This observation guides the decision to implement early stopping to choose the optimal point before overfitting occurs. This decision reflects a trade-off: the process of finding the best possible outcome.
Parameter Norm Penalties (L1/L2 Regularization): Penalties applied to model parameters help maintain simplicity in the model and prevent overfitting. This approach is useful in structured data analysis and results in models that are less sensitive to noise.
blue chart: the ideal fit with penalties applied green chart: a severe overfitting due to not applying the penalty
- Widely used in deep learning, particularly with image data; involves artificially creating training data through transformations like rotations and flips to improve model robustness.
Synthetic Minority Over-sampling Technique (SMOTE):
- Used in structured data analysis for addressing class imbalance by generating synthetic examples near existing minority class data points.
- Techniques to prevent co-adaptation of features by randomly omitting subsets of features or reducing the complexity of the learning model during training.
- Randomly disconnect nodes to use only part of the feature
- Use drop out in deep learning and use pruning techniques in structured data
Column sample by tree: Implemented by randomly sampling columns when creating one model
Batch normalization Noise robustness Label smoothing
Colored techniques can also be used with structured data.
Ensuring Model Robustness and Generalization
Expanding Data and Feature Sets:
- Increasing the quantity and diversity of data and incorporating more features can help in reducing underfitting.
Using More Complex Models:
- Employing models with higher capacity can sometimes capture more complex patterns in the data, reducing underfitting but requiring careful handling to avoid overfitting.
Validation and Reproducibility Strategies
Validation Strategies:
- Employing techniques like hold-out validation and cross-validation ensures that the model performs well across different subsets of data and reduces the likelihood of skewness in model evaluation.
Reproducibility:
- Ensuring reproducibility through practices like setting random seeds helps in maintaining consistency across experiments and model retraining sessions.
By implementing these strategies, one can effectively manage the trade-offs between bias and variance, enhancing the predictive performance and general applicability of machine learning models.
reference:
AI Engineer Fundamentals: Boostcams AI Tech Preparation Course [Part 2]