Refactoring Data Science Projects: Finalizing Configuration and Data Flow
Overview
Most data science projects begin as experimental scripts. While this works for quick prototyping, it often leads to a tangled mess of hard-coded paths, hidden dependencies, and bloated main functions. This tutorial focuses on the final stage of refactoring a
Prerequisites
To follow this walkthrough, you should have a solid grasp of DataLoaders and training loops is highly recommended, as well as basic knowledge of
Key Libraries & Tools
- Python: The primary language for the implementation.
- Pathlib: A modernPythonlibrary used for object-oriented filesystem paths.
- PyTorch: The deep learning framework used for the model and training logic.
Code Walkthrough: Cleaning the Main Loop
The first step involves extracting the training loop from main.py into a specialized function within running.py. This reduces the complexity of the entry point.
def run_epoch(test_runner, train_runner, tracker, epoch_id, epoch_total):
# Core epoch logic moved here
train_runner.run(epoch_id)
test_runner.run(epoch_id)
tracker.log_epoch(epoch_id, epoch_total)
By passing the runners as arguments, we remove the need for main.py to manage internal state. We also replace the non-standard hparams dictionary with clear, uppercase constants at the top of the file. This makes the hyperparameters like BATCH_SIZE and LEARNING_RATE immediately visible and editable.
Decoupling Data Loading
A common mistake is hard-coding file paths inside data-loading functions. We refactor this by injecting the path from the configuration level.
def create_data_loader(batch_size, data_path: Path, label_path: Path, shuffle=True):
images = load_image_data(data_path)
labels = load_label_data(label_path)
dataset = MNISTDataset(images, labels)
return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)
This approach ensures that MNISTDataset doesn't need to know where the files live. It simply receives the data it needs to operate.
Syntax Notes & Best Practices
Using Type Hints is non-negotiable for professional : Path or : int, you allow your IDE to catch bugs before execution. We also follow the Information Expert principle: assign responsibilities to the class that has the most information to fulfill them. The Runner class handles metrics because it sees the raw data during the training loop.
Tips & Gotchas
Avoid "shotgun surgery," where one change requires edits in five different files. If you change your data directory, you should only have to update one constant in main.py. Also, keep validation and loading separate. Don't put assert statements regarding data shape inside your loading function; move those to a dedicated validation suite or unit test.

Fancy watching it?
Watch the full video and context