Ensemble methods for enhanced decision accuracy and reduced model variance using Python scikit-learn.
Random Forest is an ensemble learning method that constructs multiple decision trees during training and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. This approach significantly improves accuracy compared to single decision trees.
Single Decision Tree
Random Forest Ensemble
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split import pandas as pd # Load dataset df = pd.read_csv('data.csv') X = df.drop('target', axis=1) y = df['target'] # Split data X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) # Initialize random forest rf = RandomForestClassifier( n_estimators=100, max_depth=5, min_samples_split=2, n_jobs=-1 ) # Train model rf.fit(X_train, y_train) # Get feature importances features = pd.Series(rf.feature_importances_, index=X.columns) features.nlargest(10).plot(kind='barh')
Training Progress
n_estimators: 100
max_depth: 5
min_samples_leaf: 5
max_features: 'log2'
criterion: 'entropy'
class_weight: 'balanced'
from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import GridSearchCV def optimize_random_forest(X, y): # Create parameter grid param_grid = { 'n_estimators': [50, 100, 150], 'max_depth': [3, 5, 7], 'min_samples_split': [2, 5, 10], 'max_features': ['sqrt', 'log2'] } # Initialize grid search rf = RandomForestRegressor(random_state=42) grid_search = GridSearchCV(rf, param_grid, cv=5, n_jobs=-1, scoring='neg_mean_squared_error') grid_search.fit(x, y) return { 'best_model': grid_search.best_estimator_, 'params': grid_search.best_params_, 'score': -grid_search.best_score_ }
n_jobs=-1
for parallel processingwarm_start=True
for iterative trainingoob_score_
for out-of-bag evaluationUse k-fold cross validation (typically 5-7 folds) with stratified sampling to prevent overfitting and ensure model reliability.
Remove low-importance features (bottom 10-20%) to simplify model complexity and improve training efficiency while maintaining prediction accuracy.
Utilize multi-core processing via n_jobs=-1
parameter
to speed up model training and hyperparameter tuning.
Feature Importance Ranking: 1. Age - 0.283 2. Income - 0.234 3. Occupation - 0.187 4. Location - 0.156 5. Education - 0.089 6. Debt - 0.078 7. History - 0.032 8. Credit Score - 0.026 9. Loans - 0.015 10. Employment - 0.010