Data Analysis

Random Forest : iris ๋ฐ์ดํ„ฐ์…‹ ํ•™์Šต, ์‹œ๊ฐํ™”

๊น€์‹ฌ์Šจ 2025. 6. 25. 08:25

1. ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ 

์—ฌ๋Ÿฌ ๊ฐœ์˜ ์˜์‚ฌ๊ฒฐ์ • ๋‚˜๋ฌด ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜๊ณ , ๊ฐ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์ข…ํ•ฉ -> ์ตœ์ข… ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๋Š” ์•™์ƒ๋ธ” ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜ 

์ฃผ๋กœ ๋ถ„๋ฅ˜, ํšŒ๊ท€ ๋ฌธ์ œ์— ์“ฐ์ž„, ๊ฐœ๋ณ„ ๋‚˜๋ฌด์˜ ์•ฝ์ ์„ ๋ณด์™„, ์ „์ฒด ๋ชจ๋ธ์˜ ์ •ํ™•๋„์™€ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋†’์ž„ 

< ๋ชฉ์  >

- ์—ฌ๋Ÿฌ ๊ฒฐ์ • ๋‚˜๋ฌด๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€ํ•˜๋Š” ๋ฐฉ๋ฒ• ๋ฐฐ์›€ใ„ด 

- ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ ๋‚ด๋ถ€ ๊ตฌ์กฐ (๊ฒฐ์ • ๋‚˜๋ฌด)๋ฅผ ์‹œ๊ฐํ™”ํ•˜์—ฌ ์ดํ•ด๋ ฅ์„ ๋†’์ด๊ณ ์ž ํ•จ 

 

2. ํ•„์ˆ˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

pandas : ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๋‹ค๋ฃจ๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ, ๋ฐ์ดํ„ฐ ๋ถ„์„, ์ „์ฒ˜๋ฆฌ ์ž‘์—… 

numpy : ๋‹ค์ฐจ์› ๋ฐฐ์—ด ์—ฐ์‚ฐ ๋ฐ ์ˆ˜์น˜ ๊ณ„์‚ฐ์˜ ํ•ต์‹ฌ 

matplotlib.pyplot : ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•œ ๋Œ€ํ‘œ์ ์ธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ 

sklearn.datasets.load_iris : ๋จธ์‹ ๋Ÿฌ๋‹ ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ์…‹ 

sklearn.ensemble.RandomForestClassifier : ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๊ตฌํ˜„ 

sklearn.tree.plot.tree : ์˜์‚ฌ๊ฒฐ์ • ๋‚˜๋ฌด ์‹œ๊ฐํ™” 

 

< ๋ถ„์„ ํ๋ฆ„ >

๋ฐ์ดํ„ฐ ๋กœ๋“œ 

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

๋ชจ๋ธ ํ•™์Šต (RandomForestClassifier)

- n_estimators : ํŠธ๋ฆฌ์˜ ๊ฐœ์ˆ˜ ์„ค์ • 

max_depth : ๊ฐ ๊ฒฐ์ • ๋‚˜๋ฌด์˜ ์ตœ๋Œ€ ๊นŠ์ด ์ œํ•œ 

๋ชจ๋ธ ์‹œ๊ฐํ™” 

3. ์ฝ”๋“œ ๋œฏ์–ด๋จน๊ธฐ 

# ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ž„ํฌํŠธ
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import plot_tree

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
iris = load_iris()

# ๋…๋ฆฝ๋ณ€์ˆ˜(X), ์ข…์†๋ณ€์ˆ˜(y) ์„ค์ •
X = iris.data[:, 2:4] # ๊ฝƒ์žŽ ๊ธธ์ด์™€ ๋„“์ด๋งŒ ์‚ฌ์šฉ
y = iris.target # ๋ถ“๊ฝƒ ํ’ˆ์ข…(0, 1, 2)

# ๋ชจ๋ธ ์ƒ์„ฑ (๋žœ๋คํฌ๋ ˆ์ŠคํŠธ)
rf = RandomForestClassifier(
    n_estimators=3,     # 3๊ฐœ์˜ ๊ฒฐ์ • ๋‚˜๋ฌด๋ฅผ ์ƒ์„ฑ
    max_depth=3,        # ๊ฐ ๋‚˜๋ฌด์˜ ์ตœ๋Œ€ ๊นŠ์ด๋Š” 3
    random_state=42     # ๊ฒฐ๊ณผ ์žฌํ˜„์„ฑ์„ ์œ„ํ•ด ๋‚œ์ˆ˜ ์„ค์ •
)

# ๋ชจ๋ธ ํ•™์Šต
rf.fit(X, y)

# ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•œ ์„ค์ •
plt.figure(figsize=(20, 5))

# ๊ฐ ๊ฒฐ์ • ๋‚˜๋ฌด ์‹œ๊ฐํ™”
for i in range(len(rf.estimators_)):
    plt.subplot(1, len(rf.estimators_), i+1) # subplot์„ ๊ฐ€๋กœ๋กœ ๋‚˜์—ด
    plot_tree(
        rf.estimators_[i],                   # ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ์˜ ๊ฐ ํŠธ๋ฆฌ
        feature_names=iris.feature_names[2:4],
        class_names=iris.target_names,
        filled=True,
        rounded=True
    )
    plt.title(f'Tree {i+1}') # ๊ฐ subplot ์ œ๋ชฉ
plt.tight_layout() # ์„œ๋ธŒํ”Œ๋กฏ ๊ฐ„๊ฒฉ ์ž๋™์กฐ์ •
plt.show()

 

4. ๋ชจ๋ธ ๋ถ„์„ ์ธ์‚ฌ์ดํŠธ 

1.ํŠธ๋ฆฌ๋ณ„ ์ฃผ์š” ํŠน์ง•, ํŒจํ„ด 

<Tree1>

์ฒซ ๋ฒˆ์งธ ๋…ธ๋“œ๋Š” ๊ฝƒ์žŽ ๋„“์ด (petal width)๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ถ„ํ• , ๊ฐ’์ด 0.8cm ์ดํ•˜์ด๋ฉด setosa๋กœ ๋ถ„๋ฅ˜๋จ 

์ดํ›„ ๊ฝƒ์žŽ ๊ธธ์ด (petal length)์™€ ๊ฝƒ์žŽ ๋„“์ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ versicolor์™€ virginica๋ฅผ ๊ตฌ๋ถ„ํ•จ 

์ง€๋‹ˆ์ง€์ˆ˜ ๋‚ฎ์€ ๋…ธ๋“œ๊ฐ€ ๋ณด์ž„ -> ๋ช…ํ™•ํ•œ ํด๋ž˜์Šค ๊ตฌ๋ถ„์ด ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ์Œ 

 

<Tree 2, 3>

2 -> ์ง€๋‹ˆ์ง€์ˆ˜0.08์—์„œ ์•ฝ๊ฐ„์˜ ํด๋ž˜์Šค ํ˜ผ์žฌ๊ฐ€ ๋ฐœ์ƒํ•˜์ง€๋งŒ ์ „๋ฐ˜์ ์œผ๋กœ ๋ช…ํ™•ํ•œ ๊ธฐ์ค€์œผ๋กœ ์•ˆ์ •์ ์œผ๋กœ ๋ถ„๋ฅ˜ 

3 -> ์ง€๋‹ˆ์ง€์ˆ˜ 0.0  ์™„๋ฒฝํžˆ ํ•œ ํด๋ž˜์Šค๋งŒ์„ ๋‚˜ํƒ€๋ƒ„ (์‹ ๋ขฐ๋„ ๋งค์šฐ ๋†’์Œ)

 

+) ์ง€๋‹ˆ์ง€์ˆ˜๋Š” ํ•ด๋‹น ๋…ธ๋“œ์—์„œ ํด๋ž˜์Šค๊ฐ€ ์–ผ๋งˆ๋‚˜ ์„ž์—ฌ์žˆ๋Š”์ง€ ๋‚˜ํƒ€๋ƒ„. (0์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก ๋…ธ๋“œ์˜ ์ˆœ๋„๊ฐ€ ๋†’์•„์ง„๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธ)

ํ•˜์œ„ ๋…ธ๋“œ๋กœ ๊ฐˆ์ˆ˜๋ก ์ง€๋‹ˆ์ง€์ˆ˜๊ฐ€ ๋‚ฎ์•„์ ธ ๋ช…ํ™•ํ•œ ๋ถ„๋ฅ˜๊ฐ€ ์ด๋ฃจ์–ด์ง

 

 

5. ๋ฐฐ์šด ์  

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ ์šฉ?

์‹ค์ œ ์‹ค๋ฌด์—์„œ๋Š” 3๊ฐœ์˜ ํŠธ๋ฆฌ๋กœ๋Š” ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์Œ. 100๊ฐœ ์ด์ƒ์˜ ํŠธ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด ์•ˆ์ •์„ฑ ํ™•๋ณด 

max_depth๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ์™€ ๋ณต์žก์„ฑ์— ๋”ฐ๋ผ ์กฐ์ •๋จ. ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€๋ฅผ ์œ„ํ•ด 5~20 ์‚ฌ์ด์—์„œ ์ตœ์ ์˜ ๊ฐ’์„ ์ฐพ๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์  

 

์ž๋™ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ๋„์ž…ํ•ด๋ณด๊ธฐ ( GridSearchCV, RandomizedSearchCV )

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X, y)
print(grid_search.best_params_)

 

 

 

 

๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ์กฐ์‹ฌ 

์‹ค์ œ ๋ฐ์ดํ„ฐ์—์„œ๋Š” ํด๋ž˜์Šค ๋ถˆ๊ท ํ˜• (ํ•œ ํด๋ž˜์Šค ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ๊ฑฐ๋‚˜, ์ ์€ ๊ฒฝ์šฐ) ์ž์ฃผ ๋ฐœ์ƒ 

์ด๋•Œ, class_weight = 'balanced' ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ SMOTE ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•ด ๋ฐ์ดํ„ฐ ๊ท ํ˜•์„ ๋งž์ถ”๋Š” ๊ฒƒ์ด ํ•„์š” 

 

ํŠน์ง• ์ค‘์š”๋„ (Feature importance) ๊ธฐ๋Šฅ ํ™œ์šฉํ•˜๊ธฐ 

์–ด๋–ค ๋ณ€์ˆ˜๊ฐ€ ์ค‘์š”ํ•œ ์ง€ ๋น ๋ฅด๊ฒŒ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ 

์˜ˆ ) 

importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), iris.feature_names[2:4], rotation=45)
plt.title("Feature Importance")
plt.show()