2절 Scikit - learn을 통한 예측 알고리즘 실습해보기

4장 Model Selection (2)

본 포스팅은 [위키북스- 파이썬 머신러닝 완벽 가이드]를 활용한 스터디 포스팅입니다.

저번 포스팅에서는 Cross-validation중에서 K-fold에 대한 개념과 구현 과정을 살펴보았다.

이번 포스팅에서는 K-fold를 구현하는데 좀 더 효율적인 방법을 지원해주는 라이브러리들을 살펴볼 것이다.

Stratified K-fold 교차 검증

- Stratified K-fold 란?

K fold는 random으로 데이터 셋을 split 해주는데, 이 때문에 레이블 값의 분포(비율)가 기존 데이터 full 셋에서의 분포(비율)와 크게 달라질 수도 있다.

Stratified K-fold 교차 검증 방법은 원본 데이터에서 레이블 분포를 먼저 고려한 뒤, 이 분포와 동일하게 학습 및 검증 데이터 세트를 분배한다.

(예를 들면 A와 B로 이루어진 원본 데이터의 구성 비율이 A : B = 3 : 7 이라면, training set 및 test set의 데이터의 구성비율도 A : B = 3 : 7이 되게 만들어 주는 개념이다.)

- Stratified K-fold 교차 검증 방법

Stratified K-fold 교차 검증 방법은 StratifiedKFold 라이브러리로 쉽게 구현이 가능하다.

아래처럼 라이브러리를 생성하고, k의 개수인 n_splits 을 설정해주면 된다.

In [1]:

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=3)

iris= load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['label']=iris.target
iris_df['label'].value_counts()

Out[1]:

2    50
1    50
0    50
Name: label, dtype: int64

iris 원본 데이터의 target에 대한 구성은 50 : 50 : 50 의 비율을 가지고 있다.

Stratified K-fold로 데이터를 split 하면, 쪼개진 train/test데이터에도 같은 구성 비율로 생성이 될 것이다.

아래의 코드는 K-fold에서 쪼개진 3세트의 train/test데이터의 구성 비율을 구하는 코드이다.

In [2]:

n_iter=0

for train_index, test_index in skf.split(iris_df, iris_df['label']):
    n_iter +=1
    label_train = iris_df['label'].iloc[train_index]
    label_test = iris_df['label'].iloc[test_index]
    print('## 교차 검증:{0}'.format(n_iter))
    print('학습 레이블 데이터 분포:\n', label_train.value_counts())
    print('검증 레이블 데이터 분포:\n', label_test.value_counts())

## 교차 검증:1
학습 레이블 데이터 분포:
 2    33
1    33
0    33
Name: label, dtype: int64
검증 레이블 데이터 분포:
 2    17
1    17
0    17
Name: label, dtype: int64
## 교차 검증:2
학습 레이블 데이터 분포:
 2    33
1    33
0    33
Name: label, dtype: int64
검증 레이블 데이터 분포:
 2    17
1    17
0    17
Name: label, dtype: int64
## 교차 검증:3
학습 레이블 데이터 분포:
 2    34
1    34
0    34
Name: label, dtype: int64
검증 레이블 데이터 분포:
 2    16
1    16
0    16
Name: label, dtype: int64

위의 결과를 보면, 각 모든 train데이터 및 test데이터 세트에서의 구성 비율이 1 : 1 : 1로 같음을 볼 수 있다.

교차 검증을 간단하게 하는 방법 : cross_val_score( )

- cross_val_score( ) 기능

cross_val_score( ) 함수는 교차 검증을 쉽게 하기 위한 함수이다.

cross_val_score( 알고리즘, 피쳐 데이터 세트, 레이블 데이터 세트, 스코어링 기준, 폴드수(cv) ) 형태로 쓰기 때문에, 교차 검증의 컨트롤이 쉬워진다.

- cross_val_score( ) 교차 검증 방법

cross_val_score( ) 함수를 사용하기 위해서는 cross_val_score과 cross_validate 라이브러리를 import 하면 된다.

In [3]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.datasets import load_iris

In [4]:

iris = load_iris()
data = iris.data
label = iris.target
dt_clf = DecisionTreeClassifier(random_state=156)

In [5]:

score = cross_val_score(dt_clf, data, label, scoring='accuracy', cv=3)

print('교차 검증별 정확도:', np.round(score, 4))
print('평균 검증 정확도:', np.round(np.mean(score), 4))

교차 검증별 정확도: [0.9804 0.9216 0.9792]
평균 검증 정확도: 0.9604

이 검증에서는 K-fold 교차 검증 방법중 자동으로 Stratified K-fold 방법을 선택하여 분류 및 스코어링하는 것을 볼 수 있다.

GridSearchCV

- GridSearchCV 란?

교차 검증과 최적 하이퍼 파라미터 튜닝을 한번에 적용할 수 있게 해주는 사이킷런의 기능 중 하나이다.
Classifier 나 Regressor 같은 알고리즘에 사용되는 하이퍼 파라미터를 순차적으로 입력하면서 편리하게 최적 파라미터를 도출할 수 있게 해준다.

# 하이퍼 파라미터 : 머신러닝 알고리즘을 구성하는 요소중 하나로, 하이퍼 파라미터를 고정하면서 알고리즘 성능을 개선할 수 있다.

- GridSearchCV 교차 검증 방법

cross_val_score( ) 함수를 사용하려면, 우선 알고리즘 성능을 비교할 하이퍼 파라미터를 정해야 한다. 아래처럼 결정트리의 깊이가 1~3 depth일때와, k의 개수가 2~3개 일 때의 알고리즘 성능을 비교하기 위해 하이퍼 파라미터를 제한해 주었다.

In [6]:

grid_parameters = {'max_depth' : [1, 2, 3],
                   'min_samples_split' : [2, 3]} 
# 총 6회에 걸친 분석 결과를 도출함  -> 고로 시간이 오래걸리는건 감수해야함!!

GridSearchCV 교차 검증을 위해서는 GridSearchCV 라이브러리를 설치해주어야 한다.

In [7]:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

In [8]:

iris = load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target,
                                                    test_size=0.2, random_state=121)

dtree = DecisionTreeClassifier()
parameters = {'max_depth' : [1, 2, 3], 'min_samples_split' : [2, 3]}

위의 예시들에서는 결정 트리 함수 자체를 변수에 할당하였다면, 이번에는 GridSearchCV 함수를 변수에 할당한 후 fit과 predict에도 적용하는 점이 다르다.

In [9]:

import pandas as pd

# param_grid의 하이퍼 파라미터를 3개의 train, test의 fold 로 나누어 테스트 수행 설정
# refit이란 True가 디폴트 -> True이면 가장 좋은 파라미터 설정으로 계속 재학습시킨다. 

grid_dtree = GridSearchCV(dtree, param_grid=parameters, cv=3, refit=True)
grid_dtree.fit(x_train, y_train)

Out[9]:

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'max_depth': [1, 2, 3], 'min_samples_split': [2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [10]:

# GridSearchCV 결과를 추출해 DataFrame 으로 변환

scores_df = pd.DataFrame(grid_dtree.cv_results_)
scores_df
scores_df[['params', 
		'mean_test_score',
		'rank_test_score',
 		'split0_test_score',
		'split1_test_score',
		'split2_test_score']]

Out[10]:

	params	mean_test_score	rank_test_score	split0_test_score	split1_test_score	split2_test_score
0	{'max_depth': 1, 'min_samples_split': 2}	0.700000	5	0.700	0.7	0.70
1	{'max_depth': 1, 'min_samples_split': 3}	0.700000	5	0.700	0.7	0.70
2	{'max_depth': 2, 'min_samples_split': 2}	0.958333	3	0.925	1.0	0.95
3	{'max_depth': 2, 'min_samples_split': 3}	0.958333	3	0.925	1.0	0.95
4	{'max_depth': 3, 'min_samples_split': 2}	0.975000	1	0.975	1.0	0.95
5	{'max_depth': 3, 'min_samples_split': 3}	0.975000	1	0.975	1.0	0.95

따라서 결과가 좋은 {'max_depth': 3, 'min_samples_split': 2} 를 선택하면 된다.

In [11]:

print('GridSearchCV 최적 파라미터 : ', grid_dtree.best_params_)
print('GridSearchCV 최고 정확도 : {0:.4f}'.format(grid_dtree.best_score_))

GridSearchCV 최적 파라미터 :  {'max_depth': 3, 'min_samples_split': 2}
GridSearchCV 최고 정확도 : 0.9750

GridSearchCV를 통해 test 데이터의 예측 정확도도 쉽게 구할 수 있다.

In [12]:

# GridSearchCV 의 refit 으로 이미 학습된 estimator 반환
estimator = grid_dtree.best_estimator_

In [13]:

# 위의 estimator로 예측
pred = estimator.predict(x_test)
print('테스트 데이트 세트 정확도 : {0:.4f}'.format(accuracy_score(y_test, pred)))

테스트 데이트 세트 정확도 : 0.9667

이번 포스팅에서는 여러가지의 K-fold 방법에 대해 알아보았다.

다음 포스팅에서는 예측 및 분류 모델에 사용되는 데이터를 어떻게 전처리 해야하는지에 대해 살펴보도록 하겠다.

'Data Analysis > Statistics with Python' 카테고리의 다른 글

[Statistics with Python] 06. 분류 알고리즘 평가 방법 - F1-score, ROC, AUC (0)	2019.08.17
[Statistics with Python] 05. 분류 알고리즘 평가 방법 - Accuracy, precision, recall (0)	2019.08.17
[Statistics with Python] 04. 데이터 전처리 : Encoding, One-Hot Encoding (with Scikit-learn) (0)	2019.08.17
[Statistics with Python] 02. 교차검증 Cross validation, K-fold (with Scikit-learn) (0)	2019.08.14
[Statistics with Python] 01. Python을 활용한 머신러닝 시작하기 (0)	2019.08.14

Dlearner의 자기계발 블로그

[Statistics with Python] 03. Stratified K-fold, cross_val_score, GridSearchCV (with Scikit-learn)

2절 Scikit - learn을 통한 예측 알고리즘 실습해보기

'Data Analysis > Statistics with Python' 카테고리의 다른 글

댓글

티스토리툴바

[Statistics with Python] 03. Stratified K-fold, cross_val_score, GridSearchCV (with Scikit-learn)

2절 Scikit - learn을 통한 예측 알고리즘 실습해보기

'Data Analysis > Statistics with Python' 카테고리의 다른 글

관련글

댓글

티스토리툴바