Home Introduction to Machine Learning Fundamentals (1)
Post
Cancel

Introduction to Machine Learning Fundamentals (1)

Understanding Underfitting and Overfitting in Machine Learning

Understanding Underfitting and Overfitting in Machine Learning

Definitions:

  • Fit: The ability of a model to accurately describe or predict outcomes based on the given data.
  • Underfitting: Occurs when a model fails to capture the underlying pattern of the data, typically leading to poor performance on both training and new data.
  • Overfitting: Happens when a model captures noise instead of the underlying data pattern, performing well on training data but poorly on new, unseen data.

Example Using XGBoost Model:

  • Models can exhibit underfitting, an ideal fit, or overfitting, depending on how well they generalize from the training data to unseen data.
  • Underfitting Pasted image 20240502111749

  • ideal fit Pasted image 20240502111830

  • Overfitting Pasted image 20240502111922

Datasets are a subset of the whole Pasted image 20240502150726

Our data: a population or sample

Monitoring and Managing Fit in Models

Real-time Data Monitoring:

  • Datasets in databases are often updated in real-time; thus, it’s crucial to analyze these using a fixed snapshot without altering the original data unless absolutely necessary.

Dataset Representation: Pasted image 20240502151126

  • Selecting a subset that well-represents the entire dataset is crucial for effectively training machine learning models.

Identifying Fit Issues:

  • By examining the error rates on datasets not used in training (validation datasets), you can determine if a model is underfitting or overfitting. Pasted image 20240502151233

    From a certain point, it can be observed that the error on the dataset begins to increase. This can serve as a criterion to determine whether a machine learning model is overfitting or underfitting.

    Techniques to Control Overfitting

Regularization:

  • Introducing regularization techniques helps reduce model sensitivity to noise. These include:
    • Early Stopping: Halting training when validation error begins to increase, despite improvements in training error(Also available for structured data). Pasted image 20240502154509

      Training error continues to decrease, but at a certain point, the validation error begins to increase. This observation guides the decision to implement early stopping to choose the optimal point before overfitting occurs. This decision reflects a trade-off: the process of finding the best possible outcome.

    • Parameter Norm Penalties (L1/L2 Regularization): Penalties applied to model parameters help maintain simplicity in the model and prevent overfitting. This approach is useful in structured data analysis and results in models that are less sensitive to noise. Pasted image 20240502154912

      blue chart: the ideal fit with penalties applied green chart: a severe overfitting due to not applying the penalty

Data Augmentation: Pasted image 20240502155227

  • Widely used in deep learning, particularly with image data; involves artificially creating training data through transformations like rotations and flips to improve model robustness.

Synthetic Minority Over-sampling Technique (SMOTE): Pasted image 20240502155303

  • Used in structured data analysis for addressing class imbalance by generating synthetic examples near existing minority class data points.

Dropout: Pasted image 20240502155619

  • Techniques to prevent co-adaptation of features by randomly omitting subsets of features or reducing the complexity of the learning model during training.
  • Randomly disconnect nodes to use only part of the feature
  • Use drop out in deep learning and use pruning techniques in structured data

Column sample by tree: Implemented by randomly sampling columns when creating one model Pasted image 20240502160019

Batch normalization Noise robustness Label smoothing

Colored techniques can also be used with structured data.

Ensuring Model Robustness and Generalization

Expanding Data and Feature Sets:

  • Increasing the quantity and diversity of data and incorporating more features can help in reducing underfitting.

Using More Complex Models:

  • Employing models with higher capacity can sometimes capture more complex patterns in the data, reducing underfitting but requiring careful handling to avoid overfitting.

Validation and Reproducibility Strategies

Validation Strategies:

  • Employing techniques like hold-out validation and cross-validation ensures that the model performs well across different subsets of data and reduces the likelihood of skewness in model evaluation.

Reproducibility:

  • Ensuring reproducibility through practices like setting random seeds helps in maintaining consistency across experiments and model retraining sessions.

By implementing these strategies, one can effectively manage the trade-offs between bias and variance, enhancing the predictive performance and general applicability of machine learning models.

reference:

AI Engineer Fundamentals: Boostcams AI Tech Preparation Course [Part 2]

This post is licensed under CC BY 4.0 by the author.