π§± Project Structure Overview
This document outlines a domain-driven, modular folder structure designed for scalable and maintainable data pipeline and ML projects. We borrow ideas and concepts from Domain Driven Design (DDD) and apply them to a typical data science based project structure.
π Directory Structure
domains/
βββ models/
β βββ model_a/
β β βββ config.py
β β βββ datasets.py
β β βββ features.py
β β βββ model.py
β β βββ aggregator.py
β βββ model_b/
β βββ model_c/
β
βββ engine/
β βββ execution_logic/
β βββ decision_systems/
β
βββ data/
βββ data_source_a/
βββ data_source_b/
βββ data_source_c/
βββ config.py
βββ datasets.py
π¦ shared/
shared/
βββ config/
β βββ models/
β βββ model_config_schema.py
β
βββ features/
β βββ static_attributes.py
β
βββ transforms/
β βββ geo_transforms.py
β
βββ utils/
β βββ ... (generic helpers: math, time, validation, etc.)
β
βββ data_loader.py
βββ config_builder.py
shared/
contains truly cross-domain code: reusable components, shared schemas, and general-purpose logic.
If a module has domain-specific meaning, it belongs in the appropriatedomains/
folder instead.
βοΈ tasks/
tasks/
βββ ... (entry-point task functions for orchestration)
Top-level pipeline functions β these may be executed via orchestration platforms (e.g. Airflow, Databricks), CLI, or other external triggers.
π configs/
configs/
βββ shared/
β βββ test_config.yml
β
βββ domain/
βββ models/
β βββ model_a/
β βββ config.yml
β
βββ engine/
β βββ decision_systems/
β βββ config.yml
β
βββ data/
βββ ... (YAML configs for data processing jobs)
Configuration files are stored separately from code. They are typically parsed by domain config classes and contain things like dataset paths, model parameters, and job settings.
π§ Design Principles
Principle | Description |
---|---|
Domain-Driven | Group logic by business/domain responsibility, not technical layer |
Cohesion Over Reuse | Keep related logic together β avoid premature abstraction |
Shared for Stability | Use shared/ only for stable, cross-domain components |
Config is Composed | Domain config models can use shared schemas to reduce duplication |
Extensibility First | Domains can extend or override shared logic when needed |
Tasks Are Thin | Orchestration functions should assemble domain logic, not implement it |
β Best Practices
- Keep domain-specific components self-contained
- Use shared config schemas (e.g., model hyperparameters) to avoid duplication
- Avoid coupling domains through shared logic unless it is stable and intentional
- Compose and extend rather than duplicate logic when variations are needed