로보테크AI

융합_로보테크 AI 자율주행 로봇 개발자 과정-26/03/19

steezer 2026. 3. 19. 18:30

https://www.kaggle.com/

 

Kaggle: The World’s AI Proving Ground

Discover what actually works in AI. Join millions of builders, researchers, and labs evaluating agents, models, and frontier technology through crowdsourced benchmarks, competitions, and hackathons.

www.kaggle.com

 

특성공학

 

머신러닝 모델은 다음과 같은 단계로 작업

1. 학습

2. 평가

3. 예측

 

학습이 잘 되는 것이 가장 중요

학습이 잘 되려면 특성이 적절하게 있는 좋은 데이터가 준비되야 함

(생선 기준)길이만 있는 것보단 길이, 높이, 두께가 있는 것이 훨씬 좋음

더 많은 특성? -> 특성공학

특성공학으로 특성이 많아지면 좋음 -> 무조건 많으면 좋은가? X

ㄴ 과대 적합 때문에 학습 데이터에만 정확하게 되어버림

 

문제는 과대적합->과대적합이 발생하지 않으려면?

규제 적용

 

규재가 적용된 선형 회귀 모델

-릿지: 계수를 제곱한 값을 기준으로 규제 적용 <= L1 방식

-라쏘: 계수의 절대값을 기준으로 규제 적용 <= L2 방식

 

하이퍼파라미터: 머신러닝 모델이 학습할 수 없고 사람이 알려줘야하는 파라미터

 

첫번째 머신러닝 미션

# 1. (선택사항) 시각화를 통해 데이터 살펴보기, 판다스의 데이터 프레임 활용
# 2. (선택사항) 데이터에 대한 전처리 - 스케일 전처리, 특성공학
# 3. 훈련세트와 테스트세트 나누기 (랜덤 스테이트는 자유롭게)
# 4. 모델 선정 - kneighbor 보다 좋은 걸 찾아서 써보기
# 5. 학습, 평가, 예측까지
# 6. (선택사항) 과대적합 또는 과소 적합 판단이 되면 수정

 

데이터 로드 및 초기 탐색

from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

cancer.data # 실제 특성값
cancer.feature_names # 각 특성의 이름을 문자열로 적은 목록
cancer.target # 특성에 따른 타깃값... 0이면 악성, 1이면 양성
print()

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df_cancer['target'] = cancer.target

print("DataFrame head:")
print(df_cancer.head())

print("\nDataFrame info:")
df_cancer.info()

print("\nDataFrame descriptive statistics:")
print(df_cancer.describe())

plt.figure(figsize=(6, 4))
sns.countplot(x='target', data=df_cancer)
plt.title('Distribution of Target Classes')
plt.xlabel('Target (0: Malignant, 1: Benign)')
plt.ylabel('Count')
plt.show()
DataFrame head:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  worst area  \
0                 0.07871  ...          17.33           184.60      2019.0   
1                 0.05667  ...          23.41           158.80      1956.0   
2                 0.05999  ...          25.53           152.50      1709.0   
3                 0.09744  ...          26.50            98.87       567.7   
4                 0.05883  ...          16.67           152.20      1575.0   

   worst smoothness  worst compactness  worst concavity  worst concave points  \
0            0.1622             0.6656           0.7119                0.2654   
1            0.1238             0.1866           0.2416                0.1860   
2            0.1444             0.4245           0.4504                0.2430   
3            0.2098             0.8663           0.6869                0.2575   
4            0.1374             0.2050           0.4000                0.1625   

   worst symmetry  worst fractal dimension  target  
0          0.4601                  0.11890       0  
1          0.2750                  0.08902       0  
2          0.3613                  0.08758       0  
3          0.6638                  0.17300       0  
4          0.2364                  0.07678       0  

[5 rows x 31 columns]

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         569 non-null    float64
 15  compactness error        569 non-null    float64
 16  concavity error          569 non-null    float64
 17  concave points error     569 non-null    float64
 18  symmetry error           569 non-null    float64
 19  fractal dimension error  569 non-null    float64
 20  worst radius             569 non-null    float64
 21  worst texture            569 non-null    float64
 22  worst perimeter          569 non-null    float64
 23  worst area               569 non-null    float64
 24  worst smoothness         569 non-null    float64
 25  worst compactness        569 non-null    float64
 26  worst concavity          569 non-null    float64
 27  worst concave points     569 non-null    float64
 28  worst symmetry           569 non-null    float64
 29  worst fractal dimension  569 non-null    float64
 30  target                   569 non-null    int64  
dtypes: float64(30), int64(1)
memory usage: 137.9 KB

DataFrame descriptive statistics:
       mean radius  mean texture  mean perimeter    mean area  \
count   569.000000    569.000000      569.000000   569.000000   
mean     14.127292     19.289649       91.969033   654.889104   
std       3.524049      4.301036       24.298981   351.914129   
min       6.981000      9.710000       43.790000   143.500000   
25%      11.700000     16.170000       75.170000   420.300000   
50%      13.370000     18.840000       86.240000   551.100000   
75%      15.780000     21.800000      104.100000   782.700000   
max      28.110000     39.280000      188.500000  2501.000000   

       mean smoothness  mean compactness  mean concavity  mean concave points  \
count       569.000000        569.000000      569.000000           569.000000   
mean          0.096360          0.104341        0.088799             0.048919   
std           0.014064          0.052813        0.079720             0.038803   
min           0.052630          0.019380        0.000000             0.000000   
25%           0.086370          0.064920        0.029560             0.020310   
50%           0.095870          0.092630        0.061540             0.033500   
75%           0.105300          0.130400        0.130700             0.074000   
max           0.163400          0.345400        0.426800             0.201200   

       mean symmetry  mean fractal dimension  ...  worst texture  \
count     569.000000              569.000000  ...     569.000000   
mean        0.181162                0.062798  ...      25.677223   
std         0.027414                0.007060  ...       6.146258   
min         0.106000                0.049960  ...      12.020000   
25%         0.161900                0.057700  ...      21.080000   
50%         0.179200                0.061540  ...      25.410000   
75%         0.195700                0.066120  ...      29.720000   
max         0.304000                0.097440  ...      49.540000   

       worst perimeter   worst area  worst smoothness  worst compactness  \
count       569.000000   569.000000        569.000000         569.000000   
mean        107.261213   880.583128          0.132369           0.254265   
std          33.602542   569.356993          0.022832           0.157336   
min          50.410000   185.200000          0.071170           0.027290   
25%          84.110000   515.300000          0.116600           0.147200   
50%          97.660000   686.500000          0.131300           0.211900   
75%         125.400000  1084.000000          0.146000           0.339100   
max         251.200000  4254.000000          0.222600           1.058000   

       worst concavity  worst concave points  worst symmetry  \
count       569.000000            569.000000      569.000000   
mean          0.272188              0.114606        0.290076   
std           0.208624              0.065732        0.061867   
min           0.000000              0.000000        0.156500   
25%           0.114500              0.064930        0.250400   
50%           0.226700              0.099930        0.282200   
75%           0.382900              0.161400        0.317900   
max           1.252000              0.291000        0.663800   

       worst fractal dimension      target  
count               569.000000  569.000000  
mean                  0.083946    0.627417  
std                   0.018061    0.483918  
min                   0.055040    0.000000  
25%                   0.071460    0.000000  
50%                   0.080040    1.000000  
75%                   0.092080    1.000000  
max                   0.207500    1.000000  

[8 rows x 31 columns]

데이터 전처리

from sklearn.preprocessing import StandardScaler

X = df_cancer.drop('target', axis=1)
y = df_cancer['target']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Original feature shape:", X.shape)
print("Scaled feature shape:", X_scaled.shape)
print("First 5 rows of scaled features:\n", X_scaled[:5])
Original feature shape: (569, 30)
Scaled feature shape: (569, 30)
First 5 rows of scaled features:
 [[ 1.09706398e+00 -2.07333501e+00  1.26993369e+00  9.84374905e-01
   1.56846633e+00  3.28351467e+00  2.65287398e+00  2.53247522e+00
   2.21751501e+00  2.25574689e+00  2.48973393e+00 -5.65265059e-01
   2.83303087e+00  2.48757756e+00 -2.14001647e-01  1.31686157e+00
   7.24026158e-01  6.60819941e-01  1.14875667e+00  9.07083081e-01
   1.88668963e+00 -1.35929347e+00  2.30360062e+00  2.00123749e+00
   1.30768627e+00  2.61666502e+00  2.10952635e+00  2.29607613e+00
   2.75062224e+00  1.93701461e+00]
 [ 1.82982061e+00 -3.53632408e-01  1.68595471e+00  1.90870825e+00
  -8.26962447e-01 -4.87071673e-01 -2.38458552e-02  5.48144156e-01
   1.39236330e-03 -8.68652457e-01  4.99254601e-01 -8.76243603e-01
   2.63326966e-01  7.42401948e-01 -6.05350847e-01 -6.92926270e-01
  -4.40780058e-01  2.60162067e-01 -8.05450380e-01 -9.94437403e-02
   1.80592744e+00 -3.69203222e-01  1.53512599e+00  1.89048899e+00
  -3.75611957e-01 -4.30444219e-01 -1.46748968e-01  1.08708430e+00
  -2.43889668e-01  2.81189987e-01]
 [ 1.57988811e+00  4.56186952e-01  1.56650313e+00  1.55888363e+00
   9.42210440e-01  1.05292554e+00  1.36347845e+00  2.03723076e+00
   9.39684817e-01 -3.98007910e-01  1.22867595e+00 -7.80083377e-01
   8.50928301e-01  1.18133606e+00 -2.97005012e-01  8.14973504e-01
   2.13076435e-01  1.42482747e+00  2.37035535e-01  2.93559404e-01
   1.51187025e+00 -2.39743838e-02  1.34747521e+00  1.45628455e+00
   5.27407405e-01  1.08293217e+00  8.54973944e-01  1.95500035e+00
   1.15225500e+00  2.01391209e-01]
 [-7.68909287e-01  2.53732112e-01 -5.92687167e-01 -7.64463792e-01
   3.28355348e+00  3.40290899e+00  1.91589718e+00  1.45170736e+00
   2.86738293e+00  4.91091929e+00  3.26373441e-01 -1.10409044e-01
   2.86593405e-01 -2.88378148e-01  6.89701660e-01  2.74428041e+00
   8.19518384e-01  1.11500701e+00  4.73268037e+00  2.04751088e+00
  -2.81464464e-01  1.33984094e-01 -2.49939304e-01 -5.50021228e-01
   3.39427470e+00  3.89339743e+00  1.98958826e+00  2.17578601e+00
   6.04604135e+00  4.93501034e+00]
 [ 1.75029663e+00 -1.15181643e+00  1.77657315e+00  1.82622928e+00
   2.80371830e-01  5.39340452e-01  1.37101143e+00  1.42849277e+00
  -9.56046689e-03 -5.62449981e-01  1.27054278e+00 -7.90243702e-01
   1.27318941e+00  1.19035676e+00  1.48306716e+00 -4.85198799e-02
   8.28470780e-01  1.14420474e+00 -3.61092272e-01  4.99328134e-01
   1.29857524e+00 -1.46677038e+00  1.33853946e+00  1.22072425e+00
   2.20556166e-01 -3.13394511e-01  6.13178758e-01  7.29259257e-01
  -8.68352984e-01 -3.97099619e-01]]

 

훈련세트, 테스트 세트 분리

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)
X_train shape: (426, 30)
y_train shape: (426,)
X_test shape: (143, 30)
y_test shape: (143,)

 

모델 선정 및 학습

from sklearn.ensemble import RandomForestClassifier

# RandomForestClassifier: 여러 결정 트리를 사용하여 분류 성능을 향상시키는 앙상블 모델
# 과적합을 줄이고 높은 정확도를 제공(k-Nearest Neighbor보다 좋은 성능 기대)
model = RandomForestClassifier(random_state=42)

model.fit(X_train, y_train)

 

모델 평가 및 예측

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")
Accuracy: 0.9650
Precision: 0.9667
Recall: 0.9775
F1-score: 0.9721

 

과대적합/과소적합 진단 및 수정

y_train_pred = model.predict(X_train)

accuracy_train = accuracy_score(y_train, y_train_pred)
precision_train = precision_score(y_train, y_train_pred)
recall_train = recall_score(y_train, y_train_pred)
f1_train = f1_score(y_train, y_train_pred)

print("\n--- Training Set Performance ---")
print(f"Accuracy: {accuracy_train:.4f}")
print(f"Precision: {precision_train:.4f}")
print(f"Recall: {recall_train:.4f}")
print(f"F1-score: {f1_train:.4f}")

print("\n--- Test Set Performance (for comparison) ---")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")
--- Training Set Performance ---
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1-score: 1.0000

--- Test Set Performance (for comparison) ---
Accuracy: 0.9650
Precision: 0.9667
Recall: 0.9775
F1-score: 0.9721

 

과대적합 관리

from sklearn.ensemble import RandomForestClassifier

# 조정된 하이퍼파라미터로 새로운 RandomForestClassifier를 인스턴스화
# max_depth는 각 트리의 깊이를 제한하여 너무 특정적인 패턴을 학습하는 것을 방지
# min_samples_leaf는 각 리프 노드가 최소한의 샘플을 갖도록 하여 모델을 더 견고하게 만듦
model_tuned = RandomForestClassifier(max_depth=8, min_samples_leaf=5, random_state=42)

# 튜닝된 모델 학습
model_tuned.fit(X_train, y_train)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_train_pred_tuned = model_tuned.predict(X_train)
y_test_pred_tuned = model_tuned.predict(X_test)

# 훈련 세트에서 평가
accuracy_train_tuned = accuracy_score(y_train, y_train_pred_tuned)
precision_train_tuned = precision_score(y_train, y_train_pred_tuned)
recall_train_tuned = recall_score(y_train, y_train_pred_tuned)
f1_train_tuned = f1_score(y_train, y_train_pred_tuned)

# 테스트 세트에서 평가
accuracy_test_tuned = accuracy_score(y_test, y_test_pred_tuned)
precision_test_tuned = precision_score(y_test, y_test_pred_tuned)
recall_test_tuned = recall_score(y_test, y_test_pred_tuned)
f1_test_tuned = f1_score(y_test, y_test_pred_tuned)

print("--- Tuned Model Training Set Performance ---")
print(f"Accuracy: {accuracy_train_tuned:.4f}")
print(f"Precision: {precision_train_tuned:.4f}")
print(f"Recall: {recall_train_tuned:.4f}")
print(f"F1-score: {f1_train_tuned:.4f}")

print("\n--- Tuned Model Test Set Performance ---")
print(f"Accuracy: {accuracy_test_tuned:.4f}")
print(f"Precision: {precision_test_tuned:.4f}")
print(f"Recall: {recall_test_tuned:.4f}")
print(f"F1-score: {f1_test_tuned:.4f}")
--- Tuned Model Training Set Performance ---
Accuracy: 0.9789
Precision: 0.9779
Recall: 0.9888
F1-score: 0.9833

--- Tuned Model Test Set Performance ---
Accuracy: 0.9720
Precision: 0.9670
Recall: 0.9888
F1-score: 0.9778

 

정답 코드

from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

import pandas as pd 
cancer_df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
cancer_df["target"] = cancer.target
cancer_df

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit(cancer.data)
input_scaled = ss.transform(cancer.data)

from sklearn.model_selection import train_test_split
train_input, test_input, train_target, test_target = \
train_test_split(input_scaled, cancer.target, random_state=21)

from sklearn.neighbors import KNeighborsClassifier
kn = KNeighborsClassifier()
kn.fit(train_input, train_target)
print("훈련세트 스코어:", kn.score(train_input, train_target))
print("테스트세트 스코어:", kn.score(test_input, test_target))

import numpy as np
indexes = np.arange(len(test_input))
np.random.shuffle(indexes)
print("무작위 예측:", kn.predict(test_input[indexes[:5]]))
print("실제 타겟은:", test_target[indexes[:5]])

로지스틱 회귀

학습 후에 데이터 샘플을 주면 모델이 골라주던 방식

ㄴ-> 모델이 A일 확률은 n%, B일 확률은 n%

ㄴ 가장 높은 확률을 알려주는 형식

 

데이터 준비

import pandas as pd

fish = pd.read_csv('https://bit.ly/fish_csv_data')
fish.head()


#    Species	Weight	Length	Diagonal Height	Width
# 0	Bream	242.0	25.4	30.0	11.5200	4.0200
# 1	Bream	290.0	26.3	31.2	12.4800	4.3056
# 2	Bream	340.0	26.5	31.1	12.3778	4.6961
# 3	Bream	363.0	29.0	33.5	12.7300	4.4555
# 4	Bream	430.0	29.0	34.0	12.4440	5.1340

print(pd.unique(fish['Species']))
# ['Bream' 'Roach' 'Whitefish' 'Parkki' 'Perch' 'Pike' 'Smelt']

fish_input = fish[['Weight','Length','Diagonal','Height','Width']].to_numpy()

print(fish_input[:5])
# [[242.      25.4     30.      11.52     4.02  ]
#  [290.      26.3     31.2     12.48     4.3056]
#  [340.      26.5     31.1     12.3778   4.6961]
#  [363.      29.      33.5     12.73     4.4555]
#  [430.      29.      34.      12.444    5.134 ]]

fish_target = fish['Species'].to_numpy()

from sklearn.model_selection import train_test_split

train_input, test_input, train_target, test_target = train_test_split(
    fish_input, fish_target, random_state=42)
    
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
ss.fit(train_input)
train_scaled = ss.transform(train_input)
test_scaled = ss.transform(test_input)

 

k-최근접 이웃 분류기의 확률 예측

from sklearn.neighbors import KNeighborsClassifier

kn = KNeighborsClassifier(n_neighbors=3)
kn.fit(train_scaled, train_target)

print(kn.score(train_scaled, train_target))
print(kn.score(test_scaled, test_target))

# 0.8907563025210085
# 0.85

print(kn.classes_)
# ['Bream' 'Parkki' 'Perch' 'Pike' 'Roach' 'Smelt' 'Whitefish']

print(kn.predict(test_scaled[:5]))
# ['Perch' 'Smelt' 'Pike' 'Perch' 'Perch']

import numpy as np

proba = kn.predict_proba(test_scaled[:5])
print(np.round(proba, decimals=4))
# [[0.     0.     1.     0.     0.     0.     0.    ]
#  [0.     0.     0.     0.     0.     1.     0.    ]
#  [0.     0.     0.     1.     0.     0.     0.    ]
#  [0.     0.     0.6667 0.     0.3333 0.     0.    ]
#  [0.     0.     0.6667 0.     0.3333 0.     0.    ]]

distances, indexes = kn.kneighbors(test_scaled[3:4])
print(train_target[indexes])
# [['Roach' 'Perch' 'Perch']]

 

로지스틱 회귀

이름은 회귀이지만 분류 모델

선형 회귀와 동일하게 선형 방정식 학습

ex)

z = a*(Weight) + b*(Length) + c*(Diagonal) + d*(Height) + e*(Width) + f

ㄴ a,b,c,d,e => 가중치 or 계수, z => 0~1(0~100%), 

z가 아주 작은 음수일때 0, 아주 큰 양수일 때 1이 되게 하려면? => 시그모이드, 로지스틱 함수 사용

시그모이드: 선형 방정식의 출력을 0과 1사이의 값으로 압축하며 이진 분류를 위해 사용

 

import numpy as np
import matplotlib.pyplot as plt

z = np.arange(-5, 5, 0.1)
phi = 1 / (1 + np.exp(-z))

plt.plot(z, phi)
plt.xlabel('z')
plt.ylabel('phi')
plt.show()

딱 0.5일 경우 라이브러리마다 다르지만 사이킷런은 음성 클래스로 판단

 

로지스틱 회귀로 이진 분류 수행

불리언 인덱싱: 넘파이 배열은 True, False 값을 전달하여 행을 선택할 수 있음

char_arr = np.array(['A', 'B', 'C', 'D', 'E'])
print(char_arr[[True, False, True, False, False]])
# ['A' 'C']

bream_smelt_indexes = (train_target == 'Bream') | (train_target == 'Smelt')
train_bream_smelt = train_scaled[bream_smelt_indexes]
target_bream_smelt = train_target[bream_smelt_indexes]

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(train_bream_smelt, target_bream_smelt)

print(lr.predict(train_bream_smelt[:5]))
# ['Bream' 'Smelt' 'Bream' 'Bream' 'Bream']

print(lr.predict_proba(train_bream_smelt[:5]))
# [[0.99759855 0.00240145]
#  [0.02735183 0.97264817]
#  [0.99486072 0.00513928]
#  [0.98584202 0.01415798]
#  [0.99767269 0.00232731]]

print(lr.classes_)
# ['Bream' 'Smelt']

print(lr.coef_, lr.intercept_)
# [[-0.4037798  -0.57620209 -0.66280298 -1.01290277 -0.73168947]] [-2.16155132]

decisions = lr.decision_function(train_bream_smelt[:5])
print(decisions)
# [-6.02991358  3.57043428 -5.26630496 -4.24382314 -6.06135688]

from scipy.special import expit

print(expit(decisions))
# [0.00240145 0.97264817 0.00513928 0.01415798 0.00232731]

 

로지스틱 회귀로 다중 분류 수행하기

lr = LogisticRegression(C=20, max_iter=1000)
lr.fit(train_scaled, train_target)

print(lr.score(train_scaled, train_target))
print(lr.score(test_scaled, test_target))
# 0.9327731092436975
# 0.925

print(lr.predict(test_scaled[:5]))
# ['Perch' 'Smelt' 'Pike' 'Roach' 'Perch']

proba = lr.predict_proba(test_scaled[:5])
print(np.round(proba, decimals=3))
# [[0.    0.014 0.842 0.    0.135 0.007 0.003]
#  [0.    0.003 0.044 0.    0.007 0.946 0.   ]
#  [0.    0.    0.034 0.934 0.015 0.016 0.   ]
#  [0.011 0.034 0.305 0.006 0.567 0.    0.076]
#  [0.    0.    0.904 0.002 0.089 0.002 0.001]]

print(lr.classes_)
# ['Bream' 'Parkki' 'Perch' 'Pike' 'Roach' 'Smelt' 'Whitefish']

print(lr.coef_.shape, lr.intercept_.shape)
# (7, 5) (7,)

decision = lr.decision_function(test_scaled[:5])
print(np.round(decision, decimals=2))
# [[ -6.5    1.03   5.16  -2.73   3.34   0.33  -0.63]
#  [-10.86   1.93   4.77  -2.4    2.98   7.84  -4.26]
#  [ -4.34  -6.23   3.17   6.49   2.36   2.42  -3.87]
#  [ -0.68   0.45   2.65  -1.19   3.26  -5.75   1.26]
#  [ -6.4   -1.99   5.82  -0.11   3.5   -0.11  -0.71]]

from scipy.special import softmax

proba = softmax(decision, axis=1)
print(np.round(proba, decimals=3))
# [[0.    0.014 0.841 0.    0.136 0.007 0.003]
#  [0.    0.003 0.044 0.    0.007 0.946 0.   ]
#  [0.    0.    0.034 0.935 0.015 0.016 0.   ]
#  [0.011 0.034 0.306 0.007 0.567 0.    0.076]
#  [0.    0.    0.904 0.002 0.089 0.002 0.001]]

 

소프트맥스: 다중 분류에서 여러 선형 방정식의 출력 결과를 정규화하여 합이 1이 되도록 만듦

 

지도학습: (문제 -> 정답)을 알려준 다음 비슷한 문제를 내고 맞추게 하는 것

 

선형회귀: (특징 -> 결과값)을 알려준 다음 경향성을 파악하게 만들고 문제를 내는 것

 

점진적 학습

데이터를 한번에 다 주지 않고, 조금씩 나눠 모델을 학습 시킴

 

확률적 경사 하강법

훈련 세트에서 샘플 하나씩 꺼내 손실 함수의 경사를 따라 최적의 모델을 찾는 알고리즘

 

손실 함수

확률적 경사 하강법이 최적화할 대상

ㄴ로지스틱 손실 함수 or 이진 크로스엔트로피 손실 함수

ㄴ크로스엔트로피 손실 함수: 다중 분류에서 사용하는 손실 함수

 

에포크

확률적 경사 하강법에서 전체 샘플을 모두 사용하는 한 번 반복을 의미

 

조기 종료

과대적합이 시작하기 전에 훈련을 멈추는 것

 

힌지 손실

서포트 벡터 머신이라 불리는 머신러닝 알고리즘을 위한 손실 함수

 

import pandas as pd

fish = pd.read_csv('https://bit.ly/fish_csv_data')

fish_input = fish[['Weight','Length','Diagonal','Height','Width']].to_numpy()
fish_target = fish['Species'].to_numpy()

from sklearn.model_selection import train_test_split

train_input, test_input, train_target, test_target = train_test_split(
    fish_input, fish_target, random_state=42)

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
ss.fit(train_input)
train_scaled = ss.transform(train_input)
test_scaled = ss.transform(test_input)

from sklearn.linear_model import SGDClassifier

sc = SGDClassifier(loss='log_loss', max_iter=10, random_state=42)
sc.fit(train_scaled, train_target)

print(sc.score(train_scaled, train_target))
print(sc.score(test_scaled, test_target))
# 0.773109243697479
# 0.775
# /usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_stochastic_gradient.py:702: ConvergenceWarning: Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.
#   warnings.warn(

sc.partial_fit(train_scaled, train_target)

print(sc.score(train_scaled, train_target))
print(sc.score(test_scaled, test_target))
# 0.8151260504201681
# 0.85

 

에포크와 과대/과소 적합

import numpy as np

sc = SGDClassifier(loss='log_loss', random_state=42)

train_score = []
test_score = []

classes = np.unique(train_target)

for _ in range(0, 300):
    sc.partial_fit(train_scaled, train_target, classes=classes)

    train_score.append(sc.score(train_scaled, train_target))
    test_score.append(sc.score(test_scaled, test_target))
    
import matplotlib.pyplot as plt

plt.plot(train_score)
plt.plot(test_score)
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.show()

sc = SGDClassifier(loss='log_loss', max_iter=100, tol=None, random_state=42)
sc.fit(train_scaled, train_target)

print(sc.score(train_scaled, train_target))
print(sc.score(test_scaled, test_target))

# 0.957983193277311
# 0.925

sc = SGDClassifier(loss='hinge', max_iter=100, tol=None, random_state=42)
sc.fit(train_scaled, train_target)

print(sc.score(train_scaled, train_target))
print(sc.score(test_scaled, test_target))
# 0.9495798319327731
# 0.925

두번째 머신러닝 미션

 

iris

from sklearn.datasets import load_iris
iris = load_iris() # 붓꽃 다중분류 데이터
import pandas as pd
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df["target"] = iris.target
print(iris_df) # sepal : 꽃받침, petal: 꽃
#타깃0 : setosa, 타깃1 : versicolor, 타깃2 : virginica
     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                  5.1               3.5                1.4               0.2   
1                  4.9               3.0                1.4               0.2   
2                  4.7               3.2                1.3               0.2   
3                  4.6               3.1                1.5               0.2   
4                  5.0               3.6                1.4               0.2   
..                 ...               ...                ...               ...   
145                6.7               3.0                5.2               2.3   
146                6.3               2.5                5.0               1.9   
147                6.5               3.0                5.2               2.0   
148                6.2               3.4                5.4               2.3   
149                5.9               3.0                5.1               1.8   

     target  
0         0  
1         0  
2         0  
3         0  
4         0  
..      ...  
145       2  
146       2  
147       2  
148       2  
149       2  

[150 rows x 5 columns]
# 훈련세트 나누기
from sklearn.model_selection import train_test_split

train_input, test_input, train_target, test_target = \
train_test_split(iris.data, iris.target, test_size=0.2, random_state=7)

print(train_input.shape)
print(test_input.shape)
(120, 4)
(30, 4)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(train_input, train_target)

print("훈련세트 점수:", model.score(train_input, train_target))
print("테스트세트 점수:", model.score(test_input, test_target))
훈련세트 점수: 0.9916666666666667
테스트세트 점수: 0.8666666666666667
# 과대 적합
import numpy as np
indexes = np.arange(len(test_input))
np.random.shuffle(indexes)
print("무작위 예측:", model.predict(test_input[indexes[:5]]))
print("실제 타깃은:", test_target[indexes[:5]])
무작위 예측: [1 0 1 1 1]
실제 타깃은: [2 0 1 1 1]
# 하이퍼파라미터 튜닝
c_list = [0.01, 0.1, 1, 10, 100]
train_score = []
test_score = []
for c in c_list :
  model = LogisticRegression(C=c, max_iter=1000)
  model.fit(train_input, train_target)
  train_score.append(model.score(train_input, train_target))
  test_score.append(model.score(test_input, test_target))
print(train_score)
print(test_score)
[0.8833333333333333, 0.9666666666666667, 0.9916666666666667, 0.9916666666666667, 0.9833333333333333]
[0.7666666666666667, 0.8333333333333334, 0.8666666666666667, 0.8666666666666667, 0.8666666666666667]
import matplotlib.pyplot as plt
plt.plot(np.log10(c_list), train_score)
plt.plot(np.log10(c_list), test_score)
plt.xlabel("C")
plt.ylabel("R^2")
plt.show()

260319.ipynb
0.18MB