Ensemble

Ensemble
- 개념 : reduce the error through constructing multiple machine learning models
  - generalization erorr = Bias^2 + Variance + Irreducible Error
  - reduce generalization error by
    - 1.Reduce variance
    - 2.Reduce bias
- General approach to constructing ensemble (2 step)
  - step1. Diversity Generation
    - 개념 : construct multiple models(base models) from the training data
      - identical model은 combining해서 얻을게 아무것도 없음 (반드시 각 모델은 달라야 함)
      - significant diversity among the models 가 있을 때, Ensemble의 결과가 좋음
    - High Diversity & High accuracy of base model = high performance model
  - step2. Combination
    - 개념 : combining the multiple models to predict the target
- Ensemble Technique
  - 1.Bagging = bootstrap aggregating의 줄임말
    - bootstrap aggregating
    - 1.Bootstrap → generate diversity
      - choosing items uniformly at random with replacement
      - each data bootstrap used for different training dataset
    - 2.Aggregating (← make output)
      - Majority voting ← for classification problem
      - Average of perdicted values ← for regression
    - Why ensemble work?
      - classifier
        
        base classifier error가 0.5보다 작으면, 항상 Ensemble error가 base classifier error(앱실론)보다 작기 때문.
      - regression
        
        수식 증명 Expected error of the ensemble ≤ average error made by M individual models
    - Random Forest(for classification & regression)
      - specialized bagging for DT algorithm
      - injecting additional randomness into the tree building → to additional diversity, to ensure each tree is different
        
        build many tree → reduce overfitting by averaging thier results(서로 다른 방향으로 과대적합된 트리를 많이 만들면 그 결과를 평균냄으로써 과대적합된 양을 줄일 수 있다. → 트리 모델의 예측 성능 유지하면서 과대적합 줄이는 것 가능)
        
        DT 단점 : train data에 overfit되는 경향이 있었음 → RF가 이 문제 해결
      - 트리를 랜덤하게 만드는 방법
        
        1.Bagging
        
        트리를 만들때 사용하는 데이터 포인트를 무작위로 선택
        
        bootstrap : each DT in RF being built on a different dataset
        
        2.Randomized tree = base model
        
        각 노드에서 전체 특성을 대상으로 최선의 테스트를 찾는 것x → 알고리즘이 각 노드에서 후보 특성을 무작위로 선택한 후 이 후보들 중에서 최선의 테스트를 찾음
        
        분할 테스트에서 특성을 무작위로 선택
        
        max_features : # of features to consider in each split
        
        high max_features :
        
        trees in RF is quite similar & fit the data easily
        
        max_features = n_features
        
        Randomized tree = Normal DT
        
        트리의 각 분기에서 모든 특성을 고려하므로 특성 선택에 무작위성이 들어가지 않음 (부트스트랩 심플링으로 인한 무작위성은 그대로지만.)
        
        low max_features : trees in RF is quite different(diverse) & each tree is deep
        
        max_features = root p(round up) ← classification problem
        
        max_features = p/3(round down) ← regression problem
      - hyperparameter
        
        n_estimator : # of randomized tree, larger is always better
        
        bootstrap : whether bootstrap sampels are used (diversity를 위해 주로 True)
        
        max_features: # of variables to be consider in each split
        
        DT의 main hyperparameter
      - Useful 한 이유
        
        not sensitive to hyperparameter
        
        not easily overfit
        
        fast (large data일 경우 time consuming이 발생할 수 있으나, CPU여러개에 independently하게 모델 학습시키면 됌)
  - 2.Boosting
    - 개념
      - an iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified objects(한마디로, weight for each instances)
    - 단점
      - sequencially reduce error on train data → easy to overfit
    - 특징
      - 여러 종류의 classification algorithm을 base model로 사용 가능 (DT, SVM,…)
      - there is order → each model affected by previous model
    - 1.Training phase
      - All n objects in D are assigned equal weight
      - A series of k classifiers is iteratively learned
        
        weighted sampling : obeject weight를 기반으로 sample 생성
        
        build a classifier for the sample
        
        update object weight
        
        wrongly/correctly classified → weight increase/decrease
    - 2.Test phase
      - each classifier returns its class prediction for x_new
      - assign the class with the most votes to x(weighted voting)
        
        the weight of each classifier’s vote is a function of its accuracy
  - Bagging VS Boosting
  - 구현
    - 결정트리에 직접 bagging 구현
      - 랜덤 포레스트와 달리 max_samples 매개변수에서 bootstrap 샘플의 크기를 지정할수 있음
    - 랜덤 포레스트 사용

'MachineLearning' 카테고리의 다른 글

Training & Validation & Test set (0)	2022.10.09

DDORI

Ensemble

'MachineLearning' 카테고리의 다른 글

티스토리툴바

Ensemble

'MachineLearning' 카테고리의 다른 글

'MachineLearning' Related Articles

티스토리툴바