aydie-mllib 1.2.1
pip install aydie-mllib
Project description
aydie-mllib
is a Python library
designed to automate and simplify the process of
training and tuning machine learning models. By
leveraging a simple YAML configuration file, you can
easily test multiple algorithms, perform
hyperparameter tuning with
GridSearchCV
, and find the best model
for your data without writing repetitive boilerplate
code.
Features
- Configuration-Driven: Define your entire model training pipeline in a single YAML file.
- Automated Grid Search: Automatically performs hyperparameter tuning for multiple models.
-
Model Agnostic: Works with any
scikit-learn compatible model (e.g.,
RandomForestRegressor
,SVR
,XGBClassifier
). - Find the Best: Compares the tuned models and returns the one with the highest score.
- Easy to Use: Includes a helper function to generate a sample configuration file to get you started instantly.
Quickstart Guide
1. Generate the Configuration File
This script creates a sample
model_config.yaml
in a new
config
directory. This file acts as the
blueprint for your training pipeline.
from aydie_mllib.config import generate_sample_model_config
# Creates 'config/model_config.yaml'
file_path = generate_sample_model_config(export_dir="config")
print(f"Sample config generated at: {file_path}")
2. Customize `model_config.yaml`
Modify the YAML file to define the models and
hyperparameter grids you want to test. Here, we set
up a RandomForestRegressor
and an
XGBRegressor
.
grid_search:
module: sklearn.model_selection
class: GridSearchCV
params:
cv: 5
verbose: 1
model_selection:
module_0:
module: sklearn.ensemble
class: RandomForestRegressor
params:
random_state: 42
search_param_grid:
n_estimators:
- 100
- 200
max_depth:
- 5
- 10
module_1:
module: xgboost
class: XGBRegressor
params:
objective: reg:squarederror
search_param_grid:
n_estimators:
- 50
- 100
learning_rate:
- 0.05
- 0.1
3. Find the Best Model
Finally, use the ModelBuilder
to load
your data and configuration, run the automated
training and tuning process, and retrieve the
best-performing model.
import pandas as pd
from sklearn.model_selection import train_test_split
from aydie_mllib import ModelBuilder
# --- 1. Load your data ---
# As an example, let's create some dummy data
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
X = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
# --- 2. Initialize the ModelBuilder ---
# Point it to your configuration file
model_builder = ModelBuilder(model_config_path="config/model_config.yaml")
# --- 3. Get the best model ---
# The get_best_model method runs the entire pipeline
best_model_detail = model_builder.get_best_model(X=X, y=y, base_accuracy=0.6)
# --- 4. Print the results ---
print("\n--- Best Model Found ---")
print(f"Model Class: {best_model_detail.best_model.__class__.__name__}")
print(f"Best Score (R^2): {best_model_detail.best_score:.4f}")
print(f"Best Parameters: {best_model_detail.best_parameters}")
How it Works
The library is centered around the
ModelBuilder
class, which orchestrates
the entire process based on your
model_config.yaml
file.
-
grid_search
section: Defines the hyperparameter search strategy. By default, it usessklearn.model_selection.GridSearchCV
. You can customize its parameters likecv
(cross-validation folds). -
model_selection
section: This is a dictionary where each key (e.g.,module_0
) represents a model to be evaluated.-
module
: The Python module where the model class is located (e.g.,sklearn.ensemble
orxgboost
). -
class
: The name of the model class (e.g.,RandomForestRegressor
). -
params
: A dictionary of fixed parameters passed to the model's constructor. -
search_param_grid
: The dictionary of hyperparameters to be tuned by the grid search.
-