Data Analysis

DecisionTreeClassifier ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด : ์†Œ๋“์ˆ˜์ค€ ์˜ˆ์ธก ์˜ˆ์ œ

๊น€์‹ฌ์Šจ 2025. 6. 21. 14:41

๋ชฉ์  : ์„ฑ์ธ ์ธ๊ตฌ์˜ ์ธ๊ตฌํ†ต๊ฒŒ / ์ง์—… ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ ์†Œ๋“( >50K ) ์—ฌ๋ถ€ ์˜ˆ์ธก 

๋ชจ๋ธ : DecisionTreeClassifier (์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด) 

๋ฐ์ดํ„ฐ : adult.csv 

์ „์ฒ˜๋ฆฌ : ๋ถˆํ•„์š” ๋ณ€์ˆ˜ ์ œ๊ฑฐ, ๋ฒ”์ฃผํ˜• -> ์›ํ•ซ์ธ์ฝ”๋”ฉ 

ํ‰๊ฐ€ : Confusion Matrix + Accuracy, Precision, Recall, F1 score => ๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ฒฐ๊ณผ๋ฅผ ์ •ํ™•ํ•˜๊ณ  ์ƒ์„ธํ•˜๊ฒŒ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•จ

 

์ปจํ“จ์ „ ๋งคํŠธ๋ฆญ์Šค๋กœ ๋ชจ๋ธ์˜ ์„ธ๋ถ€์  ์˜ค๋ฅ˜๋ฅผ ์ ๊ฒ€ํ•˜๊ณ , ์„ฑ๋Šฅํ‰๊ฐ€์ง€ํ‘œ๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋‹ค์–‘ํ•œ ๊ด€์ ์—์„œ ์ข…ํ•ฉ์ ์œผ๋กœ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์Œ. 

 

+) ์ปจํ“จ์ „ ๋งคํŠธ๋ฆญ์Šค : ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ ๋ชจ๋ธ์ด ์–ผ๋งˆ๋‚˜ ์ž˜ ์˜ˆ์ธกํ–ˆ๋Š”์ง€ ๋‚˜ํƒ€๋‚ด๋Š” ํ‘œ 

์˜ˆ์ธก๊ฒฐ๊ณผ๋ฅผ ์‹ค์ œ๊ฐ’๊ณผ ๋น„๊ตํ•ด์„œ ๋งž์ถ˜ ๊ฒƒ, ํ‹€๋ฆฐ ๊ฒƒ์„ ๋ณด๊ธฐ ์‰ฝ๊ฒŒ ์ •๋ฆฌ. 

ํด๋ž˜์Šค๊ฐ€ ๋‘ ๊ฐœ ์ด์ƒ (๋‹ค์ค‘ ๋ถ„๋ฅ˜)์ด๊ฑฐ๋‚˜ ํด๋ž˜์Šค ๊ฐ„ ๋ถˆ๊ท ํ˜•์ด ์žˆ์„ ๋•Œ ๋ช…ํ™•ํ•œ ์„ฑ๋Šฅ ํŒŒ์•… ๊ฐ€๋Šฅ 

<์žฅ์ >

- ์ •ํ™•ํžˆ ์–ด๋А ๋ถ€๋ถ„์—์„œ ํ‹€๋ ธ๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ์Œ 

- ์ •ํ™•๋„ (Accuracy) ์™ธ์—๋„ ๋” ์ •๋ฐ€ํ•œ ํ‰๊ฐ€ (Precision, Recall, F1-score ๋“ฑ) ๊ฐ€๋Šฅํ•ด์ง

- FP (์ž˜๋ชป๋œ ๊ธ์ •), FN (์ž˜๋ชป๋œ ๋ถ€์ •) ๊ตฌ๋ถ„ ๊ฐ€๋Šฅ

์ปจํ“จ์ „ ๋งคํŠธ๋ฆญ์Šค ์˜ˆ์‹œ (์ด์ง„๋ถ„๋ฅ˜ ์˜ˆ์‹œ)

 

+) ์„ฑ๋Šฅํ‰๊ฐ€์ง€ํ‘œ (metrics) : ๋ชจ๋ธ์ด ์–ผ๋งˆ๋‚˜ ์ •ํ™•ํ•˜๊ฒŒ ์˜ˆ์ธกํ–ˆ๋Š”์ง€ ์ •๋Ÿ‰์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•

 

  • Accuracy (์ •ํ™•๋„): ์ „์ฒด ์˜ˆ์ธก ์ค‘ ์ •ํ™•ํžˆ ์˜ˆ์ธกํ•œ ๋น„์œจ
  • Precision (์ •๋ฐ€๋„): ๋ชจ๋ธ์ด ๊ธ์ •์ด๋ผ๊ณ  ํ•œ ๊ฒƒ ์ค‘ ์‹ค์ œ ๊ธ์ •์ธ ๋น„์œจ
  • Recall (์žฌํ˜„์œจ): ์‹ค์ œ ๊ธ์ •์ธ ๊ฒƒ ์ค‘ ๋ชจ๋ธ์ด ๊ธ์ •์ด๋ผ๊ณ  ๋งž์ถ˜ ๋น„์œจ
  • F1-score: Precision๊ณผ Recall์˜ ์กฐํ™”ํ‰๊ท ์œผ๋กœ, ๋‘ ๊ฐ€์ง€๋ฅผ ๋™์‹œ์— ๊ณ ๋ คํ•˜๋Š” ์ง€ํ‘œ

 

 

< ์ „์ฒด ํ๋ฆ„ ์š”์•ฝ : ๋ฐ์ดํ„ฐ ๋ถ„์„ ์ˆœ์„œ > 

1. ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ฐ ์ฝ๊ธฐ 

2. ์ „์ฒ˜๋ฆฌ (ํƒ€๊ฒŸ ๋ณ€ํ™˜, ์ธ์ฝ”๋”ฉ, ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ ๋“ฑ)

3. ๋ฐ์ดํ„ฐ ๋ถ„ํ•  (train/test)

4. ๋ชจ๋ธ ์„ ํƒ ๋ฐ ํ•™์Šต (์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด)

5. ์˜ˆ์ธก ์ˆ˜ํ–‰ 

6. ํ‰๊ฐ€ ์ง€ํ‘œ๋กœ ๋ชจ๋ธ ์„ฑ๋Šฅ ์ธก์ • 

7. ์‹œ๊ฐํ™”๋กœ ๋ชจ๋ธ ๋‚ด๋ถ€ ๊ตฌ์กฐ ํ™•์ธ 

 

< ํ•„์ˆ˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ >

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import tree # ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay  # ์ปจํ“จ์ ผ๋งคํŠธ๋ฆญ์Šค, ์‹œ๊ฐํ™”
import sklearn.metrics as metrics # ์„ฑ๋Šฅํ‰๊ฐ€์ง€ํ‘œ

 

# ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ

df = pd.read_csv('asset/adult.csv')
df.info()

--

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education_num   48842 non-null  int64 
 5   marital_status  48842 non-null  object
 6   occupation      48842 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital_gain    48842 non-null  int64 
 11  capital_loss    48842 non-null  int64 
 12  hours_per_week  48842 non-null  int64 
 13  native_country  48842 non-null  object
 14  income          48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB
# ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ 

df['income'] = np.where(df['income']=='>50K', 'high', 'low')
-> income ์—ด์€ ๋ฌธ์ž์—ด ๋ฒ”์ฃผ๋กœ ๋˜์–ด์žˆ์Œ. ์ด์ง„ ๋ถ„๋ฅ˜์šฉ์œผ๋กœ ๊ฐ€๊ณต 
-> ์—ฐ์†Œ๋“ 5๋งŒ๋‹ฌ๋Ÿฌ ์ดˆ๊ณผํ•˜๋ฉด hign, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด low

print(df['income'].value_counts(normalize=True)) #๋ฒ”์ฃผ์˜ ๋น„์œจ ์ถœ๋ ฅ 

# low     0.760718
# high    0.239282

df = df.drop(columns='fnlwgt')
-> fnlwgt๋Š” ์ธ๊ตฌํ†ต๊ณ„๊ฐ€์ค‘์น˜๋กœ ๋ชจ๋ธ ์˜ˆ์ธก์— ์ง์ ‘์  ์˜๋ฏธ ์—†์Œ -> ์‚ญ์ œ

 

# ์›ํ•ซ์ธ์ฝ”๋”ฉ์„ ์ด์šฉํ•œ ๋ฌธ์žํƒ€์ž…๋ณ€์ˆ˜๋ฅผ ์ˆซ์žํƒ€์ž…์œผ๋กœ ๋ณ€ํ™˜ (๋ฐ์ดํ„ฐ๋ฅผ 1์ด๋‚˜ 0์œผ๋กœ ํ‘œ์‹œ)

target = df['income'] #1๊ฐœ๋ณ€์ˆ˜ (ํƒ€๊ฒŸ๋ณ€์ˆ˜ = ์ข…์†๋ณ€์ˆ˜)
df = df.drop(columns='imcome') #14ro qustn (์˜ˆ์ธก๋ณ€์ˆ˜ = ๋…๋ฆฝ๋ณ€์ˆ˜)
df = pd.get_dummies(df) #์›ํ•ซ์ธ์ฝ”๋”ฉ
df['income'] = target

-> ์›ํ•ซ์ธ์ฝ”๋”ฉ์œผ๋กœ ๋ฌธ์ž์—ด(๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜) -> ์ˆ˜์น˜ํ˜• ๋ฒกํ„ฐ
ํƒ€์ผ“ income์€ ์˜ˆ์ธก ๋Œ€์ƒ์ด๋ฏ€๋กœ ๋”ฐ๋กœ ๋ณด๊ด€ ํ›„ ๋‹ค์‹œ ํ•ฉ์นจ

 

# ๋ฐ์ดํ„ฐ ๋ถ„ํ•  
df_train, df_test = train_test_split(
	df, test_size=0.3, stratify=df['income'],random_state=1234
)

-> ์ „์ฒด ๋ฐ์ดํ„ฐ 70% ํ›ˆ๋ จ / 30% ํ…Œ์ŠคํŠธ ๋ถ„๋ฆฌ
stratify ์˜ต์…˜์œผ๋กœ high/low ํด๋ž˜์Šค ๋น„์œจ์„ ๋™์ผํ•˜๊ฒŒ ์œ ์ง€

 

# ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต 

clf = tree.DecisionTreeClassfier(random_state=1234, max_depth=3)

# ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ์ƒ์„ฑ (์ตœ๋Œ€ ๊นŠ์ด ์ œํ•œ : 3)
# ๊นŠ์ด๊ฐ€ ๊นŠ์„์ˆ˜๋ก ๋ณต์žกํ•ด์ง€๊ณ , ๊ณผ์ ํ•ฉ ์œ„ํ—˜ ๋†’์•„์ง 

train_x = df_train.drop(columns='income')
train_y = df_train['income']

model = clf.fit(X=train_x, y=train_y)

 

# ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ์‹œ๊ฐํ™” 

# ๊ทธ๋ž˜ํ”„ ์„ค์ • 
plt.rcParams.update({
	'figure.dpi' : '100', #๊ทธ๋ž˜ํ”„ ํ•ด์ƒ๋„ 
    'figure.figsize' : [12:8] #๊ทธ๋ž˜ํ”„ ํฌ๊ธฐ
})

# ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ๊ทธ๋ž˜ํ”„ 
tree.plot_tree (
	model, #๋ชจ๋ธ 
    feature_names = train_x.columns, #์˜ˆ์ธก๋ณ€์ˆ˜๋ช…๋“ค 
    class_names = ['high', 'low'], #ํƒ€๊ฒŸ๋ณ€์ˆ˜ ํด๋ž˜์Šค๋ช… (์•ŒํŒŒ๋ฒณ ์˜ค๋ฆ„์ฐจ์ˆœ์œผ๋กœ ๋งž์ถ”๊ธฐ)
    proportion = True, #ํด๋ž˜์Šค ๋ฐฐ๋ถ„ ๋น„์œจ ํ‘œ์‹œ ์—ฌ๋ถ€ 
    filled = True, #์ฑ„์›€ ์—ฌ๋ถ€ 
    rounded = Ture, # ๋…ธ๋“œ ํ…Œ๋‘๋ฆฌ ๋‘ฅ๊ธ€๊ฒŒ ํ• ๋ž˜?
    impurity = False, # ๋ถ„์ˆœ๋„ ํ‘œ์‹œ ์—ฌ๋ถ€ 
    label = 'root' #์ œ๋ชฉ ํ‘œ์‹œ ์œ„์น˜ 
    fontsize = 10 #๊ธ€์ž ํฌ๊ธฐ 
)

plt.show()

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด

< ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ์‹œ๊ฐํ™” ์ดํ•ดํ•˜๊ธฐ >

๋ฐ์ดํ„ฐ์—๋Š” ์‚ฌ๋žŒ๋“ค์˜ ์—ฌ๋Ÿฌ ์ •๋ณด (๊ฒฐํ˜ผ ์—ฌ๋ถ€, ๋‚˜์ด, ๊ต์œก ์ˆ˜์ค€, ์ž๋ณธ ์ด๋“ ๋“ฑ)๊ฐ€ ์žˆ๊ณ , ์ด ๋‚˜๋ฌด๋Š” ๊ทธ๋Ÿฐ ์ •๋ณด๋ฅผ ์ด์šฉํ•ด ์†Œ๋“์ด ๋†’์€(high) ์‚ฌ๋žŒ๊ณผ ๋‚ฎ์€(low) ์‚ฌ๋žŒ์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ๋ ค์ค€๋‹ค

=> ์ฆ‰ ์ด ๋‚˜๋ฌด๋Š” ์งˆ๋ฌธ์„ ํ†ตํ•ด high๋ƒ low๋ƒ๋ฅผ ๊ฒฐ์ •ํ•˜๊ณ  ์žˆ์Œ. 

 

* Node : ๋„ค๋ชจ ์ƒ์ž 

- ์กฐ๊ฑด (์งˆ๋ฌธ) : ์–ด๋–ค ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆŒ์ง€ ๊ฒฐ์ • (๊ฒฐํ˜ผํ–ˆ๋‰˜? ์ž๋ณธ ์ด๋“? ๋“ฑ)

- samples : ์ด ์งˆ๋ฌธ์—์„œ ๊ณ ๋ คํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ ๋น„์œจ 

- value : [high ๋น„์œจ, low ๋น„์œจ] -> ์†Œ๋“์ด ๋†’๊ฑฐ๋‚˜ ๋‚ฎ์€ ์‚ฌ๋žŒ๋“ค์˜ ๋น„์œจ 

- class : ํ•ด๋‹น ๋…ธ๋“œ์—์„œ ๊ฐ€์žฅ ๋งŽ์€ ํด๋ž˜์Šค (๋Œ€๋ถ€๋ถ„ low ๋˜๋Š” high)

- ์ƒ‰์ƒ : ํŒŒ๋ž€์ƒ‰ ๊ณ„์—ด ( ๋Œ€๋ถ€๋ถ„ low : ์†Œ๋“ ๋‚ฎ์Œ) | ์ฃผํ™ฉ์ƒ‰ ๊ณ„์—ด (๋Œ€๋ถ€๋ถ„ high : ์†Œ๋“ ๋†’์Œ) 

 

* ์‹ค์ œ๋กœ ํ•ด์„ํ•ด๋ณด์ž! 

1. ์ฒซ ๋ฒˆ์งธ ์งˆ๋ฌธ : ๊ฒฐํ˜ผ ์—ฌ๋ถ€๊ฐ€ 0.5 ์ดํ•˜์ธ๊ฐ€? = ๊ฒฐํ˜ผํ•˜์ง€ ์•Š์€ ์‚ฌ๋žŒ์ธ๊ฐ€? 

๊ฒฐํ˜ผํ•˜์ง€ ์•Š์€ ์‚ฌ๋žŒ์€ ์™ผ์ชฝ(True), ๊ฒฐํ˜ผํ•œ ์‚ฌ๋žŒ์€ ์˜ค๋ฅธ์ชฝ (False)์œผ๋กœ ๋‚˜๋‰จ (์ผ๋ฐ˜์ ์œผ๋กœ ์›-ํ•ซ ์ธ์ฝ”๋”ฉ๋œ ๊ฒฝ์šฐ ๊ธฐํ˜ผ์ด๋ฉด 1, ๋น„ํ˜ผ์ด๋ฉด 0์œผ๋กœ ๋‚˜ํƒ€๋‚ด์„œ 0.5 ์ดํ•˜๋ฉด ๋น„ํ˜ผ์œผ๋กœ ๋ถ„๋ฅ˜ํ•จ)

์ด ์งˆ๋ฌธ์ด ๊ฐ€์žฅ ๋จผ์ € ๋‚˜์˜จ ์ด์œ ๋Š” ๊ฒฐํ˜ผ ์—ฌ๋ถ€๊ฐ€ ์‚ฌ๋žŒ๋“ค์˜ ์†Œ๋“ ์ˆ˜์ค€์„ ๊ฐ€์žฅ ๋ช…ํ™•ํžˆ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ 

 

2. ์™ผ์ชฝ ๊ฐ€์ง€ (True, ๊ฒฐํ˜ผํ•˜์ง€ ์•Š์€ ์‚ฌ๋žŒ๋“ค)

๊ฒฐํ˜ผํ•˜์ง€ ์•Š์€ ์‚ฌ๋žŒ๋“ค ์ค‘ ์ž๋ณธ์ด๋“(capital_gain)์ด 7073.5 ์ดํ•˜์ธ๊ฐ€?

์ด๊ฑธ ๋‹ค์‹œ ๋‘ ๊ทธ๋ฃน์œผ๋กœ ๋‚˜๋ˆ„๋ฉด :

์ž๋ณธ ์ด๋“์ด ์ž‘์€ ์‚ฌ๋žŒ (์™ผ์ชฝ) -> ๋Œ€๋ถ€๋ถ„ ์†Œ๋“ ๋‚ฎ์Œ (low)

์ž๋ณธ ์ด๋“์ด ํฐ ์‚ฌ๋žŒ (์˜ค๋ฅธ์ชฝ) -> ๊ทธ ์ค‘์—์„œ๋„ ๋‚˜์ด๊ฐ€ 20์„ธ ์ดํ•˜์ธ๊ฐ€? ๋ฅผ ์งˆ๋ฌธํ•ด์„œ ๋‹ค์‹œ ๋‚˜๋ˆˆ๋‹ค. 

- ๋‚˜์ด๊ฐ€ ์–ด๋ฆฌ๋ฉด ์™ผ์ชฝ, ๋Œ€๋ถ€๋ถ„ ์†Œ๋“์ด ๋‚ฎ๊ณ (low)

- ๋‚˜์ด๊ฐ€ ๋งŽ์œผ๋ฉด ์˜ค๋ฅธ์ชฝ, ๋Œ€๋ถ€๋ถ„ ์†Œ๋“์ด ๋†’๋‹ค (high)

=> ๊ฒฐํ˜ผํ•˜์ง€ ์•Š๊ณ  + ์ž๋ณธ ์ด๋“์ด ํฌ๊ณ  + ๋‚˜์ด๊ฐ€ ๋งŽ๋‹ค๋ฉด ์†Œ๋“์ด ๋†’์„๊ฐ€๋Šฅ์„ฑ์ด ๋งค์šฐ ๋†’๋‹ค. 

 

3. ์˜ค๋ฅธ์ชฝ ๊ฐ€์ง€ (๊ฒฐํ˜ผํ•œ ์‚ฌ๋žŒ๋“ค)

๊ฒฐํ˜ผํ•œ ์‚ฌ๋žŒ๋“ค์€ ๊ต์œก์ˆ˜์ค€์ด 12.5 ์ดํ•˜์ธ๊ฐ€?๋ฅผ ์งˆ๋ฌธํ•ด์„œ ๋‹ค์‹œ ๋‚˜๋ˆ” 

- ๊ต์œก์ˆ˜์ค€์ด ๋‚ฎ์€ ์‚ฌ๋žŒ(์™ผ์ชฝ)

๋‹ค์‹œ ์ž๋ณธ์ด๋“์ด 5095.5 ์ดํ•˜์ธ๊ฐ€? ์งˆ๋ฌธ์„ ํ†ตํ•ด์„œ 

์ž๋ณธ์ด๋“์ด ์ž‘์œผ๋ฉด ๋Œ€๋ถ€๋ถ„ ์†Œ๋“์ด ๋‚ฎ๊ณ , ํฌ๋ฉด ์†Œ๋“์ด ๋†’์Œ 

- ๊ต์œก์ˆ˜์ค€์ด ๋†’์€ ์‚ฌ๋žŒ (์˜ค๋ฅธ์ชฝ)

์ž๋ณธ์ด๋“ ์งˆ๋ฌธ ๋™์ผ 

๊ฒฐํ˜ผ์„ ํ–ˆ๊ณ  + ๊ต์œก์ˆ˜์ค€์ด ๋†’๊ณ  + ์ž๋ณธ์ด๋“์ด ํฌ๋ฉด ๋Œ€๋ถ€๋ถ„ ์†Œ๋“ ๋†’์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ 

 

4. ํผ์„ผํŠธ์˜ ์˜๋ฏธ 

๋งจ ์œ„ ๋…ธ๋“œ์—์„œ 100%๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์˜๋ฏธํ•˜๊ณ  

์•„๋ž˜ ๋…ธ๋“œ๋กœ ๋‚ด๋ ค๊ฐ€๋ฉด์„œ ์ด ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚˜๋ˆ ์ ธ์„œ ๋น„์œจ์ด ์ค„์–ด๋“œ๋Š” ๊ฒƒ 

๊ฐ ๋…ธ๋“œ์—์„œ ์ˆซ์ž๊ฐ€ ์ ์  ์ž‘์•„์ง€๋ฉด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ž‘๊ฒŒ ๋‚˜๋ˆ ๊ฐ€๋Š” ๊ณผ์ •์„ ๋ณด์—ฌ์ค€๋‹ค 

 

์˜ˆ์ธก์„ ์œ„ํ•œ ์˜ˆ์ธก๋ณ€์ˆ˜, ํƒ€๊ฒŸ ๋ณ€์ˆ˜ ์ถ”์ถœ + ์˜ˆ์ธก

test_x = df_test.drop(columns='income')
test_y = df_test['income']

# ์˜ˆ์ธก
df_test['pred'] = model.predict(test_x)
#print(df_test)

--- ์ถœ๋ ฅ ---
       age  education_num  ...  income  pred
11712   58             10  ...     low   low
24768   39             10  ...     low   low
26758   31              4  ...     low   low
14295   23              9  ...     low   low
3683    24              9  ...     low   low
...    ...            ...  ...     ...   ...
11985   24             13  ...     low   low
48445   35             13  ...    high  high
19639   41              9  ...    high   low
21606   29              4  ...     low   low
3822    31             13  ...     low   low

[14653 rows x 109 columns]

df_test['pred'] = model.predict(test_x) : ํ•™์Šต๋œ model(์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด)์„ test_x์— ์ ์šฉํ•ด ๊ฐ ์ƒ˜ํ”Œ์˜ ์˜ˆ์ธก ๋ ˆ์ด๋ธ”์„ ๋ฐ˜ํ™˜ 

๋ฐ˜ํ™˜๋œ ์˜ˆ์ธก๊ฐ’์„ df_test์— pred๋ผ๋Š” ์ƒˆ ์นผ๋Ÿผ์œผ๋กœ ์ถ”๊ฐ€

์›๋ณธ df_test์— ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ํ•จ๊ป˜ ๋ณด๊ด€ํ•˜๋ฉด, ์˜ค๋ถ„๋ฅ˜ ์‚ฌ๋ก€๋ฅผ ์‹๋ณ„ํ•˜๊ฑฐ๋‚˜ ์ถ”๊ฐ€ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ์— ์šฉ์ดํ•จ 

# ์˜ˆ์ธก ์„ฑ๋Šฅ ํ‰๊ฐ€

# ์ปจํ“จ์ ผ๋งคํŠธ๋ฆญ์Šค
conf_mat = confusion_matrix(
    y_true = df_test['income'], # ์‹ค์ œ๊ฐ’
    y_pred = df_test['pred'], # ์˜ˆ์ธก๊ฐ’
    labels = ['high', 'low'] # ๋ ˆ์ด๋ธ” (ํด๋ž˜์Šค ๋ฐฐ์น˜ ์ˆœ์„œ, ๋ฌธ์ž ์˜ค๋ฆ„์ฐจ์ˆœ)
)
print(conf_mat)

# ๊ทธ๋ž˜ํ”„ ์„ค์ • ์ดˆ๊ธฐํ™” : ์ด์ „์— plt.rcParams.update()๋“ฑ์œผ๋กœ ๋ณ€๊ฒฝ๋œ ์˜ต์…˜์ด ์žˆ์„ ๋•Œ, ํžˆํŠธ๋งต์— ์˜ํ–ฅ์ด ๊ฐ€์ง€ ์•Š๋„๋ก ๋ฆฌ์…‹
plt.rcdefaults()

# ์ปจํ“จ์ ผ๋งคํŠธ๋ฆญ์Šค๋ฅผ ํžˆํŠธ๋งต์œผ๋กœ ํ‘œ์‹œ
p = ConfusionMatrixDisplay(
    confusion_matrix = conf_mat, # ๋งคํŠธ๋ฆญ์Šค ๋ฐ์ดํ„ฐ
    display_labels = ('high', 'low') # ํƒ€๊ฒŸ๋ณ€์ˆ˜ ํด๋ž˜์Šค๋ช…
)
p.plot(cmap = 'Blues') # ์ปฌ๋Ÿฌ๋งต
plt.show()

ํ˜ผ๋™ํ–‰๋ ฌ์€ ๋ชจ๋ธ์ด ์–ด๋””์„œ ์ฃผ๋กœ ํ‹€๋ ธ๋Š”์ง€ (FN vs FP)๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ํ•„์ˆ˜์ .

ํžˆํŠธ๋งต์œผ๋กœ ์‹œ๊ฐํ™” ํ•˜๋ฉด, ์ˆซ์ž ๋ฟ ์•„๋‹ˆ๋ผ ์ƒ‰ ๋†๋„๋กœ ์ •.์˜ค๋ถ„๋ฅ˜ ๋น„์œจ์„ ์ง๊ด€์ ์œผ๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ 

labels์™€ display_labels ์ˆœ์„œ๋ฅผ ์ผ์น˜์‹œ์ผœ์•ผ ์ถ• ์ˆœ์„œ ํ˜ผ๋™์„ ๋ฐฉ์ง€

ํžˆํŠธ๋งต ์ด๋ฏธ์ง€ ํ•ด์„ 

1. True Positive (TP) : 1801

์‹ค์ œ high๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ high๋กœ ์˜ˆ์ธกํ•œ ๊ฑด์ˆ˜ 

 

2. False Negative(FN) : 1705

์‹ค์ œ high๋ฅผ ์ž˜๋ชป low๋กœ ์˜ˆ์ธกํ•œ ๊ฑด์ˆ˜ -> ๊ณ ์†Œ๋“์ž ๋ˆ„๋ฝ 

 

3. False Positive(FP) : 582

์‹ค์ œ low๋ฅผ ์ž˜๋ชป high๋กœ ์˜ˆ์ธกํ•œ ๊ฑด์ˆ˜ -> ์ €์†Œ๋“์ž๋ฅผ ๊ณ ์†Œ๋“์œผ๋กœ ์˜ค๋ถ„๋ฅ˜ 

 

4. True Negative(TN) : 10565

์‹ค์ œ low๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ low๋กœ ์˜ˆ์ธกํ•œ ๊ฑด์ˆ˜ 

< ์„ฑ๋Šฅํ‰๊ฐ€ ์ง€ํ‘œ >

1. ์ •ํ™•๋„ (Accuracy)

: ์ •๋‹ต์„ ์–ผ๋งˆ๋‚˜ ๋งž์ท„๋‹ˆ!

(๋งค์šฐ ์ง๊ด€์ ์ด์ง€๋งŒ ๋ถˆ๊ท ํ˜• ํด๋ž˜์Šค์—์„œ๋Š” ์™œ๊ณก๋  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์ฃผ์˜)

# ์ •ํ™•๋„
acc = metrics.accuracy_score(
    y_true = df_test['income'], # ์‹ค์ œ๊ฐ’
    y_pred = df_test['pred'] # ์˜ˆ์ธก๊ฐ’
)
print(acc) # 0.8439227461953184

 

2. ์ •๋ฐ€๋„ (Precision)

: ๋ชจ๋ธ์ด high๋ผ๊ณ  ์˜ˆ์ธกํ•œ ๊ฒฝ์šฐ ์‹ค์ œ high์ธ ๋น„์œจ

: ์˜คํƒ์ง€ (FP)๋ฅผ ์ค„์ด๋ ค ํ•  ๋•Œ ์ค‘์š”ํ•œ ์ง€ํ‘œ 

: ๋ณดํ†ต ๊ธ์ • ํด๋ž˜์Šค์˜ ์ •ํ™•์„ฑ์„ ํ‰๊ฐ€ํ•  ๋•Œ ํ™œ์šฉ 

- ์‹ ์šฉํ‰๊ฐ€, ์ถ”์ฒœ์‹œ์Šคํ…œ์— ๋งค์šฐ ์ค‘์š”

- ์ •๋ฐ€๋„๊ฐ€ ๋‚ฎ๋‹ค๋ฉด ๋ชจ๋ธ์˜ high ์˜ˆ์ธก์„ ์‹ ๋ขฐํ•˜๊ธฐ ์–ด๋ ต๋‹ค๊ณ  ํŒ๋‹จ

# ์ •๋ฐ€๋„
pre = metrics.precision_score(
    y_true = df_test['income'], # ์‹ค์ œ๊ฐ’
    y_pred = df_test['pred'], # ์˜ˆ์ธก๊ฐ’
    pos_label = 'high' # ๊ด€์‹ฌ ํด๋ž˜์Šค
)
print(pre) #0.7557700377675199

 

3. ์žฌํ˜„์œจ (Recall)

: ์‹ค์ œ high ํด๋ž˜์Šค ์ค‘์—์„œ ์–ผ๋งˆ๋‚˜ ๋งŽ์€ high๋ฅผ ๋†“์น˜์ง€ ์•Š๊ณ  ์žก์•˜๋Š”์ง€ ๋น„์œจ

: ์ฆ‰, False Nagative๋ฅผ ์ค„์ด๋Š” ๋ฐ ์ค‘์š”ํ•œ ์ง€ํ‘œ 

- ์•”์ง„๋‹จ, ์‚ฌ๊ธฐํƒ์ง€์™€ ๊ฐ™์€ ๋ฏผ๊ฐํ•œ ๋ถ„์•ผ์—์„œ๋Š” FN์ด ๋งค์šฐ ์œ„ํ—˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์žฌํ˜„์œจ์„ ํŠนํžˆ ์ค‘์š”์‹œ ํ•จ

- ์žฌํ˜„์œจ์ด ๋‚ฎ๋‹ค๋ฉด ์ค‘์š”ํ•œ ์‚ฌ๋ก€๋ฅผ ๋†“์น˜๊ณ  ์žˆ๋‹ค๋Š” ์˜๋ฏธ์ด๋ฏ€๋กœ ๋ชจ๋ธ ์กฐ์ •์ด ํ•„์š”

rec = metrics.recall_score(
    y_true = df_test['income'], # ์‹ค์ œ๊ฐ’
    y_pred = df_test['pred'], # ์˜ˆ์ธก๊ฐ’
    pos_label = 'high' # ๊ด€์‹ฌ ํด๋ž˜์Šค
)
print(rec) #0.5136908157444381

 

4. F1-score

: ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์„ ์กฐํ™”๋กญ๊ฒŒ ํ‰๊ฐ€ํ•˜๋Š” ์กฐํ™”ํ‰๊ท  

์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ ์‚ฌ์ด์˜ ๊ท ํ˜•์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ์ตœ์ ์˜ ์ง€ํ‘œ

- ์ •๋ฐ€๋„, ์žฌํ˜„์œจ์ด ๋ชจ๋‘ ์ค‘์š”ํ•œ ์ƒํ™ฉ์—์„œ ์‚ฌ์šฉ๋จ 

f1 = metrics.f1_score(
    y_true = df_test['income'], # ์‹ค์ œ๊ฐ’
    y_pred = df_test['pred'], # ์˜ˆ์ธก๊ฐ’
    pos_label = 'high' # ๊ด€์‹ฌ ํด๋ž˜์Šค
)
print(f1) #0.6116488368143997

 

 

<์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ ๋” ์ดํ•ดํ•ด๋ณด๊ธฐ>

์ •๋ฐ€๋„ : ์˜ˆ์ธกํ•œ ๊ฒƒ์ด ์–ผ๋งˆ๋‚˜ ์ •ํ™•ํ•œ์ง€ 

์žฌํ˜„์œจ : ๋†“์น˜๋ฉด ํฐ์ผ๋‚จ 

 

* ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์ด ์ค‘์š”ํ•œ ์ด์œ 1 (๊ธˆ์œต๊ฑฐ๋ž˜ ์‚ฌ๊ธฐ ํƒ์ง€)

๊ธˆ์œต๊ธฐ๊ด€์—์„œ ์‹ค์ œ ์‚ฌ๊ธฐ๋ฅผ ๋†“์น˜๋ฉด ํฐ ํ”ผํ•ด ๋ฐœ์ƒ (FN์ด ๋งค์šฐ ์œ„ํ—˜ํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค)

๋Œ€๋ถ€๋ถ„ ์žฌํ˜„์œจ์„ ์ค‘์š”ํ•˜๊ฒŒ ์ƒ๊ฐ => ์ฆ‰ ์‚ฌ๊ธฐ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ์กฐ๊ธˆ์ด๋ผ๋„ ์žˆ์œผ๋ฉด ์ผ๋‹จ ์žก์•„๋ƒ„. 

ํ•˜์ง€๋งŒ ์ •๋ฐ€๋„ ์—ญ์‹œ ๊ณ ๋ คํ•˜์ง€ ์•Š์„ ์ˆ˜ ์—†์Œ. ์™œ?

์ •๋ฐ€๋„๊ฐ€ ๋„ˆ๋ฌด ๋‚ฎ์œผ๋ฉด ์ •์ƒ๊ฑฐ๋ž˜๋ฅผ ๊ณ„์† ์‚ฌ๊ธฐ๋กœ ์ž˜๋ชป ํŒ๋‹จํ•ด์„œ ๊ณ ๊ฐ์ด ๋ถˆํŽธํ•˜๊ณ  ์€ํ–‰์›ํ•œํ…Œ ํ™”๋ƒ„. (์‹ค์ œ๋กœ ํ™” ๋งŽ์ด ๋ƒ„ ใ… ใ… )

๋”ฐ๋ผ์„œ :

1์ฐจ๋กœ ์žฌํ˜„์œจ์„ ๊ทน๋Œ€ํ™” ํ•ด ์ผ๋‹จ ๋ชจ๋“  ์˜์‹ฌ ๊ฑฐ๋ž˜๋ฅผ ์žก๊ณ ,

2์ฐจ๋กœ ์‚ฌ๋žŒ์ด ์ง์ ‘ ํ™•์ธํ•˜๋Š” ์ ˆ์ฐจ๋ฅผ ํ†ตํ•ด ์ •๋ฐ€๋„๋ฅผ ๋†’์—ฌ ์ •์ƒ๊ฑฐ๋ž˜๋ฅผ ๋ถ„๋ฅ˜ 

 

* ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์ด ์ค‘์š”ํ•œ ์ด์œ 2 ( ๊ณตํ•ญ ๋ณด์•ˆ ๊ฒ€์ƒ‰๋Œ€ )

์ •๋ฐ€๋„๋Š” ์œ„ํ—˜ํ•˜๋‹ค๊ณ  ๊ฒฝ๊ณ ๊ฐ€ ์šธ๋ฆฐ ์‚ฌ๋žŒ ์ค‘ ์‹ค์ œ ์œ„ํ—˜ํ•œ ์‚ฌ๋žŒ์˜ ๋น„์œจ 

์ •๋ฐ€๋„๊ฐ€ ๋‚ฎ์œผ๋ฉด? 

๋ฌด๊ณ ํ•œ ์‚ฌ๋žŒ์ด ๊ณ„์† ๊ฑธ๋ ค์„œ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๊ณ  ์‚ฌ๋žŒ๋“ค์˜ ๋ถˆ๋งŒ์ด ์ฆ๊ฐ€ 

์žฌํ˜„์œจ์€ ์‹ค์ œ ์œ„ํ—˜ํ•œ ์‚ฌ๋žŒ ์ค‘ ์‹œ์Šคํ…œ์ด ํƒ์ง€ํ•ด๋‚ธ ๋น„์œจ 

์žฌํ˜„์œจ์ด ๋‚ฎ์œผ๋ฉด, ์ง„์งœ ์œ„ํ—˜ํ•œ ์‚ฌ๋žŒ์ด ํ†ต๊ณผํ•ด ํฐ ์‚ฌ๊ณ ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ๋‹ค. 

(๋งˆ์•ฝํƒ์ง€๊ฒฌ์€ ์žฌํ˜„์œจ์„ ๋†’์ด๋Š” ๋ฐ ํฐ ์—ญํ• ์„ ํ•˜๊ณ  ์žˆ์ง€ ์•Š์„๊นŒ)

๋”ฐ๋ผ์„œ :

๊ณตํ•ญ์€ ์‚ฌ๊ณ ์œ„ํ—˜(FN)์„ ์ ˆ๋Œ€ ๋†“์น˜๋ฉด ์•ˆ๋˜๊ธฐ ๋•Œ๋ฌธ์— (์ƒ๋ช…์ง๊ฒฐ) ์žฌํ˜„์œจ์„ ์ตœ์šฐ์„ ์œผ๋กœ ์„ค์ •

๊ทธ ๋‹ค์Œ 2์ฐจ ๊ฒ€์‚ฌ๋ฅผ ํ†ตํ•ด ์‹ค์ œ ์œ„ํ—˜์ด ์•„๋‹Œ ์‚ฌ๋žŒ์„ ๋น ๋ฅด๊ฒŒ ๋ถ„๋ฅ˜ํ•˜์—ฌ ์ •๋ฐ€๋„๋ฅผ ๋†’์ธ๋‹ค. 

 

* ๋ฐ˜๋Œ€๋กœ ์ •๋ฐ€๋„๋ฅผ ๋” ์ค‘์š”์‹œํ•˜๋Š” ๋ถ„์•ผ๋Š”?

๋ฆฌ์Šคํฌ ์ตœ์†Œํ™”๋ฅผ ์œ„ํ•ด ๋Œ€๋ถ€๋ถ„ ์žฌํ˜„์œจ์ด ์ตœ์šฐ์„ ์‹œ ๋ ํ…Œ์ง€๋งŒ,, ๋ฌธ๋“ ๊ถ๊ธˆํ•ด์ง„ ๊ฑด ์ •๋ฐ€๋„๋ฅผ ์šฐ์„ ์‹œ ํ•˜๋Š” ๋ถ„์•ผ๋‚˜ ์—…๋ฌด๊ฐ€ ์žˆ์„๊นŒ?

-> ์ž˜๋ชป๋œ ๊ธ์ • (FP)์˜ ํ”ผํ•ด๊ฐ€ ํฌ๋ฉด ์ •๋ฐ€๋„๋ฅผ ์šฐ์„ ์‹œ ํ•˜๊ฒ ์ง€?

 

< ์ •๋ฐ€๋„๊ฐ€ ๋” ์ค‘์š”ํ•œ ์‚ฌ๋ก€ >

- ์ด๋ฉ”์ผ ์ŠคํŒธ ํ•„ํ„ฐ๋ง : ์ค‘์š”ํ•œ ๋ฉ”์ผ์ด ์ŠคํŒธ์œผ๋กœ ๊ฐ€๋ฉด ์—…๋ฌด์  ์†ํ•ด ๋ฐœ์ƒ 

- ๊ณ ๋น„์šฉ ์ •๋ฐ€ ์˜๋ฃŒ ๊ฒ€์‚ฌ : 1์ฐจ ๊ฒ€์‚ฌ๋Š” ์ผ๋‹จ ์กฐ๊ธˆ์ด๋ผ๋„ ์œ„ํ—˜ํ•˜๋ฉด ๋‹ค ์žก์•„๋‚ด์ง€๋งŒ, ๋น„์šฉ์ด ํฌ๊ณ  ์œ„ํ—˜ํ•œ ์ˆ˜์ˆ  ๋“ฑ 2์ฐจ ๋‹จ๊ณ„๋Š” ์ •๋ฐ€๋„๊ฐ€ ๋†’์•„์•ผ ์˜ค์ง„์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ์Œ 

- ์‹ ์šฉํ‰๊ฐ€ : ์‹ ์šฉ์ด ์•ˆ์ข‹์€ ์‚ฌ๋žŒ์„ ์ข‹๋‹ค๊ณ  ์ž˜๋ชปํ‰๊ฐ€ํ•˜๋ฉด ๊ธˆ์œต๊ธฐ๊ด€ ์†์‹ค๋กœ ์ง๊ฒฐ

- ๋ฒ•์  ํŒ๊ฒฐ : ๋ฌด์ฃ„๋ฅผ ์œ ์ฃ„๋ผ๊ณ  ํŒ๊ฒฐํ•˜๋ฉด ์ธ๊ถŒ+์œค๋ฆฌ+์–ธ๋ก ์˜ ์งˆํƒ€