말하는 감자

2024 생명연구자원 AI활용 경진대회 후기 2 (코드작성기)

hoyo 2024. 10. 25. 09:55

결론적으론 곽봉팔님의 xgboost가 f1 score 0.23점으로 팀 내에서 성능이 가장 좋았고,

제 코드는 0.17점이 최대였습니다.

 

미삭님의 전처리파일이 성능을 높이는데 큰 도움을 주셨고, 최종 코드만 업로드합니다.

 

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, StandardScaler
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from lightgbm import LGBMClassifier
from tqdm import tqdm
train = pd.read_csv("train_h.csv")
test = pd.read_csv("test_h.csv")
submission = pd.read_csv("sample_submission.csv")

기존 train.csv와 test.csv파일에서 미삭님이 전처리해주신 _h파일들로 train과 test파일을 변경했습니다.

 

le_subclass = LabelEncoder()
train['SUBCLASS'] = le_subclass.fit_transform(train['SUBCLASS'])

X = train.drop(columns=['SUBCLASS'])
y = train['SUBCLASS']
categorical_columns = X.select_dtypes(include=['object', 'category']).columns

ordinal_encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
X_encoded = X.copy()
X_encoded[categorical_columns] = ordinal_encoder.fit_transform(X[categorical_columns])

scaler = StandardScaler()
X_encoded = pd.DataFrame(scaler.fit_transform(X_encoded), columns=X_encoded.columns)

X_test_encoded = test.copy()
X_test_encoded[categorical_columns] = ordinal_encoder.transform(test[categorical_columns])
X_test_encoded = pd.DataFrame(scaler.transform(X_test_encoded), columns=X_test_encoded.columns)

덕분에 인코딩은 가볍게 끝냈습니다.

 

log_clf = LogisticRegression(max_iter=1000)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
xgb_clf = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=6, random_state=42, eval_metric='mlogloss')
lgbm_clf = LGBMClassifier(n_estimators=100, learning_rate=0.1, max_depth=6, random_state=42)

voting_clf = VotingClassifier(estimators=[
    ('lr', log_clf), 
    ('rf', rf_clf), 
    ('xgb', xgb_clf), 
    ('lgbm', lgbm_clf)
], voting='hard')

param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.05, 0.1],
    'max_depth': [4, 6, 8],
}

grid_search = GridSearchCV(xgb_clf, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_encoded, y)

print(f"최적 하이퍼파라미터: {grid_search.best_params_}")
best_xgb_clf = grid_search.best_estimator_

이런저런 모델을 넣고빼다가 네 개의 모델로 보팅했고,

최적의 하이퍼파라미터는

 {'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 100}

였습니다.

 

kf = KFold(n_splits=5, shuffle=True, random_state=42)
accuracies = []

for fold, (train_idx, val_idx) in enumerate(tqdm(kf.split(X_encoded), total=kf.get_n_splits(), desc="K-Fold Progress")):
    X_train, X_val = X_encoded.iloc[train_idx], X_encoded.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    voting_clf.fit(X_train, y_train)
    
    y_pred = voting_clf.predict(X_val)
    accuracy = accuracy_score(y_val, y_pred)
    accuracies.append(accuracy)
    
    print(f"Fold {fold+1} Accuracy: {accuracy:.4f}")

print(f"Cross-Validation Scores: {accuracies}")
print(f"Mean Accuracy: {np.mean(accuracies):.4f}")

submission["SUBCLASS"] = le_subclass.inverse_transform(predictions)
submission.to_csv('submission.csv', encoding='UTF-8-sig', index=False)

간단하게 kfold 후 제출했고, 1681명 중 private 303위로 마무리했습니다.

 

함께 참여해주신 팀원분들께 감사합니다.