Book Review: Designing Machine Learning Systems
An important part of every data science professional’s growth journey is evolving from data preparation and model training to owning the entire machine learning development cycle. The full cycle includes data ingestion, data cleaning and preparation, feature engineering, model training, model evaluation, model deployment, model serving, and model maintenance. This wider-encompassing ownership model likely requires the data science professional to improve their conceptual understanding and applied understanding across the entire development cycle. Perhaps the most important moment in this learning journey is understanding that the model training is often the easiest step in the process. Designing Machine Learning Systems by Chip Huyen provides an overview of the machine learning development cycle through a conceptual lens.
Who should read this book?
Because this book focuses more on concepts rather than a technology, I expect it to have more longevity in terms of its usefulness. Therefore, it could serve as a useful reference for a wide range of tech professionals who might not understand each page on day one. That being said, the conceptual side of machine learning development can be less approachable to the data science novice. I recommend prospective readers to have a strong understanding of the following before reading this book:
- Statistical distributions, finding the range, median, average, and mode in a data set
- How to train a simple machine learning model in any language (e.g. python, R, etc.)
- A conceptual understanding of several machine learning algorithms (e.g. linear regression, decision trees, logistic regression, etc.)
- Experience cleaning and manipulating messy data in a computer language (e.g. python, R, SQL)
- Software engineering fundamentals (i.e. work experience in programming or took a few CS courses)
- Understand low quality vs. high quality data
If you cannot confidently say that you meet the above qualifications, I would recommend starting with a less advanced book. The appropriate audience ranges from a recent STEM graduate to any experienced tech professional looking to learn more about machine learning development; however, the latter will likely enjoy the book more. Concepts are harder to understand without some real-world experience in data science.
Which Topics Are Not Covered In This Book?
Some topics of interest for machine learning enthusiasts that are outside of the scope of this book include:
- Deep dives into specific machine learning algorithms (e.g. decision trees, neural networks, logistic regression, etc.). If you are interested in this, I would recommend reading my review of The Hundred Page Machine Learning Book.
- Applying machine learning concepts with programming (e.g. python, R, etc.)
- Feature engineering and data preparation examples using programming (e.g. python, spark, etc.)
Key Lessons From This Book
This book provides many invaluable lessons that can be incorporated into any machine learning development process. Because the book is more conceptual, the reader must decide the best way to apply these lessons. Here are some of the most important ideas that I found while reading this book:
Data Leakage
“Data leakage refers to the phenomenon when a form of the label ‘leaks’ into the set of features used for making predictions, and this same information is not available during inference.” This would be equivalent to trying to guess how many fingers someone is holding up behind their back when there is a mirror reflecting the answer back to you. The mirror reveals information that should not be available if the goal is to measure how accurate of a forecaster you are.
Before reading this book, I was not aware how frequently data leakage can occur during the model development life cycle. One reason for the frequency is that data leakage is not always obvious. From personal experience, I was recently a victim of data leakage when I oversampled my imbalanced dataset prior to splitting the data. Data leakage occurred because some of the replicated records were in both the training AND test sets. I had to adjust my code so the data was split much earlier in the process.
Model Development
Machine learning model development should be a series of phases, where you layer on complexity over time. Because machine learning is inherently complex, the first phase should be trying to solve the business problem via simple heuristics rather than developing a model. If simple heuristics are insufficient, then you should try developing and deploying a simple model. Going end-to-end early provides more visibility into the process and makes it easier to identify bugs as you add more complexity. If a simple model does not meet your expectations, try optimizing the simple model with different objective functions, hyperparameter tuning, etc. If that fails to meet your expectations, try more complex models (e.g. neural networks).
While working on a recent project, the most difficult part of the development was deployment. Even once I was able to deploy the model successfully, it uncovered several additional roadblocks outside of the scope of my responsibilities. If I had instead prioritized end-to-end progress over model optimization, our team likely would have uncovered these roadblocks much sooner. The overall time until completion for the new feature might have been reduced.
Model Tracking and Storage
Because model development is an iterative process, it is essential to track information about each model in order to compare performance between experiments, reproduce models, and to help maintain it. Information you might consider tracking includes:
- Model definition – Information needed to shape the model (e.g. loss function)
- Model parameters- actual values of the parameters of your model
- Featurize and predict functions- given a prediction request, how do you extract features and input these features into your model to get back prediction?
- Dependencies- python version, packages
- Data- data used for training the model
- Model generation code- frameworks used, how it was trained, etc.
- Experiment artifacts- artifacts generated during model development process
- Tags- help with model discovery (e.g. owner)
Even with all of this information tracked, it can be difficult to reproduce any specific model due to randomness inherent in the process. However, any information that you believe will be useful for maintenance, debugging, team coordination, or model reproducibility should at least be under consideration for tracking.
While tracking is extremely beneficial, it can slow down experimentation, I would recommend having a plan to automate this process as much as possible. On a recent project, Azure Machine Learning helped me track the package versions in my production environment, which was key for debugging a version-related scikit-learn bug.
Conclusion
Designing Machine Learning Systems is a fantastic addition to any data science professional’s library. Chip Huyen zooms out on each step in the machine learning development life cycle by focusing on concepts rather than specific implementations. After reading this book, you will have new frameworks to help you apply best practices throughout the entire machine learning development life cycle. For many of the concepts, Chip provides additional resources for further exploration. Just remember — the best practices she discusses must be combined with knowledge outside of the scope of this book in order to apply them to a specific model implementation.
~ The Data Generalist
Data Science Career Advisor
PS: Chip Huyen is teaching ML Systems Design and Strategy through the Sphere platform in an upcoming cohort. If you use my referral link, you will get \$100 in credits for the next Sphere class you sign up for.
Source: Designing Machine Learning Systems by Chip Huyen