Applying Domain-Driven Design to Machine Learning Codebases

🧱 Project Structure Overview

This document outlines a domain-driven, modular folder structure designed for scalable and maintainable data pipeline and ML projects. We borrow ideas and concepts from Domain Driven Design (DDD) and apply them to a typical data science based project structure.


πŸ“‚ Directory Structure

domains/
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ model_a/
β”‚   β”‚   β”œβ”€β”€ config.py
β”‚   β”‚   β”œβ”€β”€ datasets.py
β”‚   β”‚   β”œβ”€β”€ features.py
β”‚   β”‚   β”œβ”€β”€ model.py
β”‚   β”‚   └── aggregator.py
β”‚   β”œβ”€β”€ model_b/
β”‚   └── model_c/
β”‚
β”œβ”€β”€ engine/
β”‚   β”œβ”€β”€ execution_logic/
β”‚   └── decision_systems/
β”‚
└── data/
    β”œβ”€β”€ data_source_a/
    β”œβ”€β”€ data_source_b/
    └── data_source_c/
        β”œβ”€β”€ config.py
        └── datasets.py

πŸ“¦ shared/

shared/
β”œβ”€β”€ config/
β”‚   └── models/
β”‚       └── model_config_schema.py
β”‚
β”œβ”€β”€ features/
β”‚   └── static_attributes.py
β”‚
β”œβ”€β”€ transforms/
β”‚   └── geo_transforms.py
β”‚
β”œβ”€β”€ utils/
β”‚   └── ... (generic helpers: math, time, validation, etc.)
β”‚
β”œβ”€β”€ data_loader.py
└── config_builder.py

shared/ contains truly cross-domain code: reusable components, shared schemas, and general-purpose logic.
If a module has domain-specific meaning, it belongs in the appropriate domains/ folder instead.


βš™οΈ tasks/

tasks/
└── ... (entry-point task functions for orchestration)

Top-level pipeline functions β€” these may be executed via orchestration platforms (e.g. Airflow, Databricks), CLI, or other external triggers.


πŸ“œ configs/

configs/
β”œβ”€β”€ shared/
β”‚   └── test_config.yml
β”‚
└── domain/
    β”œβ”€β”€ models/
    β”‚   └── model_a/
    β”‚       └── config.yml
    β”‚
    β”œβ”€β”€ engine/
    β”‚   └── decision_systems/
    β”‚       └── config.yml
    β”‚
    └── data/
        └── ... (YAML configs for data processing jobs)

Configuration files are stored separately from code. They are typically parsed by domain config classes and contain things like dataset paths, model parameters, and job settings.


🧠 Design Principles

Principle Description
Domain-Driven Group logic by business/domain responsibility, not technical layer
Cohesion Over Reuse Keep related logic together β€” avoid premature abstraction
Shared for Stability Use shared/ only for stable, cross-domain components
Config is Composed Domain config models can use shared schemas to reduce duplication
Extensibility First Domains can extend or override shared logic when needed
Tasks Are Thin Orchestration functions should assemble domain logic, not implement it

βœ… Best Practices

  • Keep domain-specific components self-contained
  • Use shared config schemas (e.g., model hyperparameters) to avoid duplication
  • Avoid coupling domains through shared logic unless it is stable and intentional
  • Compose and extend rather than duplicate logic when variations are needed