R을 이용해 간단한 신경망 만들기 (16)

이전 포스트: https://blog.naver.com/jjy0501/221490196910

점점 인공지능 관련 포스팅이 복잡해지고 있어서 간단한 신경망 만들기는 아마도 이번까지 하고 다음부터는 로지스틱 회귀 분석을 진행하면서 R을 이용한 딥러닝으로 해야 할 것 같습니다. 아무튼 H2O 패키지는 상업용 인공지능 툴 답게 여러 가지 기능을 제공하는데, 그 중 흥미로운 것 가운데 하나가 바로 AutoML 기능입니다.

앞서 살펴본 것과 같이 인공 신경망에는 여러 가지 옵션이 존재하며 이 옵션을 어떻게 조정하는지에 따라 다른 결과가 나오게 됩니다. 최적의 결과를 얻기 위해서는 여러 번의 시행 착오를 거쳐야만 하는데, 상당히 많은 시간과 노력을 소모하게 됩니다. 따라서 여러 가지 모델을 테스트하고 이를 비교해서 최적 모델을 찾는 과정도 자동화할 수 있는 기법이 개발되었습니다. H2O의 AutoML 이 바로 그런 기능으로 딥러닝은 물론 랜덤 포레스트나 GBMs (Gradient Boosting Machines), Stacked Ensemble 같은 인공 지능 기법을 이용한 모델을 제시합니다. H2O.ai 에서 제시한 예제를 살펴보겠습니다.

#autoML

library(h2o)

localH2O = h2o.init()

# Import a sample binary outcome train/test set into H2O

train <- a="" class="con_link" h2o.importfile="" href="https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv" style="overflow-wrap: break-word; text-decoration-line: none;" target="_blank">https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv

test <- a="" class="con_link" h2o.importfile="" href="https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv" style="overflow-wrap: break-word; text-decoration-line: none;" target="_blank">https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv

# Identify predictors and response

y <- response="" span="">

x <- names="" setdiff="" span="" train="" y="">

# For binary classification, response should be a factor

train[,y] <- as.factor="" span="" train="" y="">

test[,y] <- as.factor="" span="" test="" y="">

# Run AutoML for 20 base models (limited to 1 hour max runtime by default)

aml <- h2o.automl="" x="x," y="y,</span">

training_frame = train,

max_models = 20,

seed = 1)

# View the AutoML Leaderboard

lb <- aml="" leaderboard="" span="">

print(lb, n = nrow(lb)) # Print all rows instead of default (6 rows)

# The leader model is stored here

aml@leader

# If you need to generate predictions on a test set, you can make

# predictions directly on the `"H2OAutoML"` object, or on the leader

# model object directly

pred <- also="" aml="" h2o.predict="" nbsp="" predict="" span="" test="" works="">

pred

이 예제는 테스트에 상당한 시간이 걸립니다. 논리 CPU 8개를 할당해도 20분 이상 걸리니 참고하시기 바랍니다. 아무튼 아래와 같은 결과를 얻을 수 있습니다.

> print(lb, n = nrow(lb)) # Print all rows instead of default (6 rows)

model_id auc logloss mean_per_class_error rmse mse

1 StackedEnsemble_AllModels_AutoML_20190324_225820 0.7890978 0.5524149 0.3196844 0.4326267 0.1871658

2 StackedEnsemble_BestOfFamily_AutoML_20190324_225820 0.7854526 0.5561179 0.3255787 0.4343390 0.1886503

3 GBM_5_AutoML_20190324_225820 0.7808367 0.5599029 0.3408479 0.4361915 0.1902630

4 GBM_2_AutoML_20190324_225820 0.7800364 0.5598060 0.3399258 0.4364149 0.1904580

5 GBM_1_AutoML_20190324_225820 0.7798268 0.5608570 0.3350957 0.4366159 0.1906335

6 GBM_3_AutoML_20190324_225820 0.7786685 0.5617903 0.3255378 0.4371886 0.1911339

7 GBM_grid_1_AutoML_20190324_225820_model_4 0.7777929 0.6275684 0.3187812 0.4667522 0.2178576

8 GBM_grid_1_AutoML_20190324_225820_model_1 0.7772185 0.6008910 0.3227162 0.4535716 0.2057272

9 GBM_4_AutoML_20190324_225820 0.7714230 0.5697120 0.3374203 0.4410703 0.1945430

10 GBM_grid_1_AutoML_20190324_225820_model_2 0.7698263 0.6070157 0.3072613 0.4566813 0.2085578

11 DRF_1_AutoML_20190324_225820 0.7428924 0.5958832 0.3554027 0.4527742 0.2050045

12 XRT_1_AutoML_20190324_225820 0.7420910 0.5993457 0.3565826 0.4531168 0.2053148

13 DeepLearning_grid_1_AutoML_20190324_225820_model_5 0.7351121 0.6072728 0.3589475 0.4570660 0.2089094

14 GBM_grid_1_AutoML_20190324_225820_model_3 0.7279124 0.6889933 0.3598375 0.4763117 0.2268729

15 GBM_grid_1_AutoML_20190324_225820_model_5 0.7167078 0.6865927 0.3871737 0.4967212 0.2467320

16 DeepLearning_1_AutoML_20190324_225820 0.7041353 0.6277710 0.3983563 0.4669145 0.2180092

17 DeepLearning_grid_1_AutoML_20190324_225820_model_1 0.6959598 0.7166216 0.3989566 0.4849767 0.2352024

18 DeepLearning_grid_1_AutoML_20190324_225820_model_4 0.6945008 0.7056079 0.4041671 0.4878114 0.2379600

19 GLM_grid_1_AutoML_20190324_225820_model_1 0.6826524 0.6385205 0.3972341 0.4726827 0.2234290

20 DeepLearning_grid_1_AutoML_20190324_225820_model_3 0.6807204 0.6708166 0.4045776 0.4810438 0.2314031

21 DeepLearning_grid_1_AutoML_20190324_225820_model_2 0.6801893 0.7273539 0.4073531 0.4926833 0.2427368

이 결과를 보면 1번에 StackedEnsemble_AllModels_AutoML_20190324_225820 모델이 가장 좋은 결과를 얻었습니다. 아래에서 이 모델에 대한 더 상세한 정보가 나옵니다.

> # The leader model is stored here

> aml@leader

Model Details:

==============

H2OBinomialModel: stackedensemble

Model ID: StackedEnsemble_AllModels_AutoML_20190324_225820

NULL

H2OBinomialMetrics: stackedensemble

** Reported on training data. **

MSE: 0.1049423

RMSE: 0.323948

LogLoss: 0.3599495

Mean Per-Class Error: 0.1240032

AUC: 0.9527414

pr_auc: 0.9548414

Gini: 0.9054828

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:

0 1 Error Rate

0 3906 799 0.169819 =799/4705

1 414 4881 0.078187 =414/5295

Totals 4320 5680 0.121300 =1213/10000

Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx

1 max f1 0.488791 0.889476 208

2 max f2 0.369647 0.931987 253

3 max f0point5 0.616275 0.897052 160

4 max accuracy 0.517914 0.879400 197

5 max precision 0.948386 1.000000 0

6 max recall 0.171604 1.000000 346

7 max specificity 0.948386 1.000000 0

8 max absolute_mcc 0.514976 0.757984 198

9 max min_per_class_accuracy 0.551951 0.877620 185

10 max mean_per_class_accuracy 0.536136 0.878565 190

Gains/Lift Table: Extract with `h2o.gainsLift(, )` or `h2o.gainsLift(, valid=, xval=)`

H2OBinomialMetrics: stackedensemble

** Reported on cross-validation data. **

** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

MSE: 0.1871658

RMSE: 0.4326267

LogLoss: 0.5524149

Mean Per-Class Error: 0.3196844

AUC: 0.7890978

pr_auc: 0.8047692

Gini: 0.5781957

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:

0 1 Error Rate

0 2301 2404 0.510946 =2404/4705

1 680 4615 0.128423 =680/5295

Totals 2981 7019 0.308400 =3084/10000

Maximum Metrics: Maximum metrics at their respective thresholds

metric threshold value idx

1 max f1 0.350203 0.749553 274

2 max f2 0.179339 0.860131 348

3 max f0point5 0.614417 0.738955 156

4 max accuracy 0.518482 0.713400 198

5 max precision 0.948036 1.000000 0

6 max recall 0.067843 1.000000 393

7 max specificity 0.948036 1.000000 0

8 max absolute_mcc 0.535747 0.425420 190

9 max min_per_class_accuracy 0.531468 0.712181 192

10 max mean_per_class_accuracy 0.535747 0.713052 190

Gains/Lift Table: Extract with `h2o.gainsLift(, )` or `h2o.gainsLift(, valid=, xval=)`

이 베스트 모델은 aml@leader에 저장되어 있으며 다른 모델을 지정하는 방법은 h2o.getModel()을 통해 가능합니다. 모델을 저장하는 방법은 h2o.saveModel()를 사용합니다 .이 때 경로도 같이 정해주어야 합니다.

# save the model

model_path <- force="TRUE)</span" h2o.savemodel="" object="model," path="getwd(),">

# load the model

saved_model <- h2o.loadmodel="" model_path="" span="">

참고

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/save-and-load-model.html

고든의 블로그 구글 분점

이 블로그 검색

R을 이용해 간단한 신경망 만들기 (16)

태그

댓글

댓글 쓰기

이 블로그의 인기 게시물

통계 공부는 어떻게 하는 것이 좋을까?

R 스튜디오 설치 및 업데이트

150년 만에 다시 울린 희귀 곤충의 울음 소리