Revuw

// ArjanCodes

Overview Most data science projects begin as experimental scripts. While this works for quick prototyping, it often leads to a tangled mess of hard-coded paths, hidden dependencies, and bloated main functions. This tutorial focuses on the final stage of refactoring a MNIST digit recognition project. We aim to move logic into dedicated runner classes, centralize configuration constants, and decouple data loading from the core model logic. By doing this, we make the project adaptable to new datasets without hunting through dozens of source files. Prerequisites To follow this walkthrough, you should have a solid grasp of Python fundamentals, including classes and functions. Familiarity with PyTorch concepts like `DataLoaders` and training loops is highly recommended, as well as basic knowledge of scikit-learn. Key Libraries & Tools * Python 3.x: The primary language for the implementation. * Pathlib: A modern Python library used for object-oriented filesystem paths. * PyTorch: The deep learning framework used for the model and training logic. Code Walkthrough: Cleaning the Main Loop The first step involves extracting the training loop from `main.py` into a specialized function within `running.py`. This reduces the complexity of the entry point. ```python def run_epoch(test_runner, train_runner, tracker, epoch_id, epoch_total): # Core epoch logic moved here train_runner.run(epoch_id) test_runner.run(epoch_id) tracker.log_epoch(epoch_id, epoch_total) ``` By passing the runners as arguments, we remove the need for `main.py` to manage internal state. We also replace the non-standard `hparams` dictionary with clear, uppercase constants at the top of the file. This makes the **hyperparameters** like `BATCH_SIZE` and `LEARNING_RATE` immediately visible and editable. Decoupling Data Loading A common mistake is hard-coding file paths inside data-loading functions. We refactor this by injecting the path from the configuration level. ```python def create_data_loader(batch_size, data_path: Path, label_path: Path, shuffle=True): images = load_image_data(data_path) labels = load_label_data(label_path) dataset = MNISTDataset(images, labels) return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle) ``` This approach ensures that `MNISTDataset` doesn't need to know where the files live. It simply receives the data it needs to operate. Syntax Notes & Best Practices Using **Type Hints** is non-negotiable for professional Python code. By defining types like `: Path` or `: int`, you allow your IDE to catch bugs before execution. We also follow the **Information Expert** principle: assign responsibilities to the class that has the most information to fulfill them. The `Runner` class handles metrics because it sees the raw data during the training loop. Tips & Gotchas Avoid "shotgun surgery," where one change requires edits in five different files. If you change your data directory, you should only have to update one constant in `main.py`. Also, keep validation and loading separate. Don't put `assert` statements regarding data shape inside your loading function; move those to a dedicated validation suite or unit test.

Oct 22, 2021

Refactoring Data Science Projects: Finalizing Configuration and Data Flow

Refactoring Data Science: Clean Architecture with Protocols and Function Composition