aydie-mllib 1.2.1

$ pip install aydie-mllib

Project description

aydie-mllib is a Python library designed to automate and simplify the process of training and tuning machine learning models. By leveraging a simple YAML configuration file, you can easily test multiple algorithms, perform hyperparameter tuning with GridSearchCV, and find the best model for your data without writing repetitive boilerplate code.

Features

Configuration-Driven: Define your entire model training pipeline in a single YAML file.
Automated Grid Search: Automatically performs hyperparameter tuning for multiple models.
Model Agnostic: Works with any scikit-learn compatible model (e.g., RandomForestRegressor, SVR, XGBClassifier).
Find the Best: Compares the tuned models and returns the one with the highest score.
Easy to Use: Includes a helper function to generate a sample configuration file to get you started instantly.

Quickstart Guide

1. Generate the Configuration File

This script creates a sample model_config.yaml in a new config directory. This file acts as the blueprint for your training pipeline.

from aydie_mllib.config import generate_sample_model_config

# Creates 'config/model_config.yaml'
file_path = generate_sample_model_config(export_dir="config")

print(f"Sample config generated at: {file_path}")

2. Customize `model_config.yaml`

Modify the YAML file to define the models and hyperparameter grids you want to test. Here, we set up a RandomForestRegressor and an XGBRegressor.

grid_search:
  module: sklearn.model_selection
  class: GridSearchCV
  params:
    cv: 5
    verbose: 1
model_selection:
  module_0:
    module: sklearn.ensemble
    class: RandomForestRegressor
    params:
      random_state: 42
    search_param_grid:
      n_estimators:
        - 100
        - 200
      max_depth:
        - 5
        - 10
  module_1:
    module: xgboost
    class: XGBRegressor
    params:
      objective: reg:squarederror
    search_param_grid:
      n_estimators:
        - 50
        - 100
      learning_rate:
        - 0.05
        - 0.1

3. Find the Best Model

Finally, use the ModelBuilder to load your data and configuration, run the automated training and tuning process, and retrieve the best-performing model.

import pandas as pd
from sklearn.model_selection import train_test_split
from aydie_mllib import ModelBuilder

# --- 1. Load your data ---
# As an example, let's create some dummy data
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
X = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])


# --- 2. Initialize the ModelBuilder ---
# Point it to your configuration file
model_builder = ModelBuilder(model_config_path="config/model_config.yaml")


# --- 3. Get the best model ---
# The get_best_model method runs the entire pipeline
best_model_detail = model_builder.get_best_model(X=X, y=y, base_accuracy=0.6)


# --- 4. Print the results ---
print("\n--- Best Model Found ---")
print(f"Model Class: {best_model_detail.best_model.__class__.__name__}")
print(f"Best Score (R^2): {best_model_detail.best_score:.4f}")
print(f"Best Parameters: {best_model_detail.best_parameters}")

How it Works

The library is centered around the ModelBuilder class, which orchestrates the entire process based on your model_config.yaml file.

grid_search section: Defines the hyperparameter search strategy. By default, it uses sklearn.model_selection.GridSearchCV. You can customize its parameters like cv (cross-validation folds).
model_selection section: This is a dictionary where each key (e.g., module_0) represents a model to be evaluated.
- module: The Python module where the model class is located (e.g., sklearn.ensemble or xgboost).
- class: The name of the model class (e.g., RandomForestRegressor).
- params: A dictionary of fixed parameters passed to the model's constructor.
- search_param_grid: The dictionary of hyperparameters to be tuned by the grid search.