Understanding Underfitting and Overfitting in Machine Learning

Definitions:

Fit: The ability of a model to accurately describe or predict outcomes based on the given data.
Underfitting: Occurs when a model fails to capture the underlying pattern of the data, typically leading to poor performance on both training and new data.
Overfitting: Happens when a model captures noise instead of the underlying data pattern, performing well on training data but poorly on new, unseen data.

Example Using XGBoost Model:

Models can exhibit underfitting, an ideal fit, or overfitting, depending on how well they generalize from the training data to unseen data.
Underfitting
ideal fit
Overfitting

Datasets are a subset of the whole

Our data: a population or sample

Monitoring and Managing Fit in Models

Real-time Data Monitoring:

Datasets in databases are often updated in real-time; thus, it’s crucial to analyze these using a fixed snapshot without altering the original data unless absolutely necessary.

Dataset Representation:

Selecting a subset that well-represents the entire dataset is crucial for effectively training machine learning models.

Identifying Fit Issues:

By examining the error rates on datasets not used in training (validation datasets), you can determine if a model is underfitting or overfitting.
From a certain point, it can be observed that the error on the dataset begins to increase. This can serve as a criterion to determine whether a machine learning model is overfitting or underfitting.
Techniques to Control Overfitting

Regularization:

Introducing regularization techniques helps reduce model sensitivity to noise. These include:
- Early Stopping: Halting training when validation error begins to increase, despite improvements in training error(Also available for structured data).
  Training error continues to decrease, but at a certain point, the validation error begins to increase. This observation guides the decision to implement early stopping to choose the optimal point before overfitting occurs. This decision reflects a trade-off: the process of finding the best possible outcome.
- Parameter Norm Penalties (L1/L2 Regularization): Penalties applied to model parameters help maintain simplicity in the model and prevent overfitting. This approach is useful in structured data analysis and results in models that are less sensitive to noise.
  blue chart: the ideal fit with penalties applied green chart: a severe overfitting due to not applying the penalty

Data Augmentation:

Widely used in deep learning, particularly with image data; involves artificially creating training data through transformations like rotations and flips to improve model robustness.

Synthetic Minority Over-sampling Technique (SMOTE):

Used in structured data analysis for addressing class imbalance by generating synthetic examples near existing minority class data points.

Dropout:

Techniques to prevent co-adaptation of features by randomly omitting subsets of features or reducing the complexity of the learning model during training.
Randomly disconnect nodes to use only part of the feature
Use drop out in deep learning and use pruning techniques in structured data

Column sample by tree: Implemented by randomly sampling columns when creating one model

Batch normalization Noise robustness Label smoothing

Colored techniques can also be used with structured data.

Ensuring Model Robustness and Generalization

Expanding Data and Feature Sets:

Increasing the quantity and diversity of data and incorporating more features can help in reducing underfitting.

Using More Complex Models:

Employing models with higher capacity can sometimes capture more complex patterns in the data, reducing underfitting but requiring careful handling to avoid overfitting.

Validation and Reproducibility Strategies

Validation Strategies:

Employing techniques like hold-out validation and cross-validation ensures that the model performs well across different subsets of data and reduces the likelihood of skewness in model evaluation.

Reproducibility:

Ensuring reproducibility through practices like setting random seeds helps in maintaining consistency across experiments and model retraining sessions.

By implementing these strategies, one can effectively manage the trade-offs between bias and variance, enhancing the predictive performance and general applicability of machine learning models.

reference:

AI Engineer Fundamentals: Boostcams AI Tech Preparation Course [Part 2]

Introduction to Machine Learning Fundamentals (1)

Understanding Underfitting and Overfitting in Machine Learning