Machine Learning Workflow: Practical Steps to Building an AI Model

In this digital age, machine learning has become one of the most fascinating topics to discuss. From sophisticated applications that help us in our daily lives to complex algorithms that drive innovation across various industries, the world of artificial intelligence (AI) offers a wealth of potential.

However, behind every successful AI model lies a structured series of steps known as a machine learning workflow. You've probably heard this term before, but how can we ensure we're following the correct process to build an effective model?

With so much information circulating on the internet, it's crucial to have a clear workflow to avoid getting lost in the multitude of concepts.

This workflow consists of various interrelated stages, from defining the problem to maintaining the model after deployment. Understanding each of these steps not only helps you build better models but also provides a comprehensive overview of the data journey from start to finish.

So, are you ready to explore the practical steps to building a sophisticated AI model? In this blog, we'll cover each stage of the machine learning workflow in an easy-to-understand manner. Don't forget, if you have any experiences or questions, share them in the comments section. Let's start this journey together!

Introduction

Did you know that every time we travel, we usually plan a route, whether with a traditional map or a digital one? Interestingly, once we're familiar with a particular route, we can even tell others which way to go. Cool, right?

Now, let's talk about the function of maps. In the context of travel, maps are both important and symbolic. As guides and navigation tools, maps provide clear directions and help us plan the best route to reach our destination efficiently and safely.

This also applies when building machine learning projects. We need clear guidance to design the best model. This is where a machine learning workflow comes in. This workflow contains the steps that must be completed before our project can be implemented in the real world.

Aurélien Géron, in his book "Hands-On Machine Learning with Scikit-Learn & TensorFlow," explains that a machine learning workflow consists of systematic steps that start from defining the problem to implementing the model. If we summarize, the results can be depicted in a diagram that describes the process.

The machine learning workflow is systematic and iterative, meaning each step needs to be evaluated and adjusted to ensure the resulting model is accurate and reliable in a production environment. This book also emphasizes the importance of understanding the data and the problem at hand, as well as using the right tools and techniques to achieve our machine learning goals.

Looking at the diagram above, you might think there are a lot of steps involved, right? But don't worry! We'll cover all these steps from scratch, hoping to build a foundational knowledge that will be invaluable in your future projects.

Read: What are the Potential Applications of ChatGPT?

Data Collection

Data collection is a crucial first step in any machine learning project. The right data will determine the quality of the model we build. Let's take a closer look at how to collect data and the various sources you can utilize.

1. Data Sources

You can collect data from a variety of unique sources. Here are some of them.

a. Internal Databases

If you work for a company, try checking internal databases that contain important information, such as customer or transaction data. You can retrieve this data directly from a database management system (DBMS) like MySQL or PostgreSQL. You can use SQL queries to retrieve the data you need.

b. Open Source Data Platforms

Many platforms, like Kaggle, provide ready-to-use datasets for analysis and machine learning projects. On Kaggle, you can find a variety of data from various fields—from health and weather to finance. Don't forget to explore the competitions there; you can learn a lot from the projects others have worked on!

c. API (Application Programming Interface)

Many platforms provide APIs for easier data access. For example, if you want to analyze sentiment on Twitter, you can use the Twitter API. This requires a little programming knowledge and authentication using API keys. Interested in trying it out?

d. Social Media

Platforms like Instagram and Facebook are goldmines for public opinion. Using scraping tools or APIs, you can collect user comments and interactions.

e. Websites and Discussion Forums

Data from review sites like Yelp or forums like Reddit can also be invaluable. Using web scraping techniques, you can gather a variety of user opinions about products and services. Don't forget to always check the site's policies!

f. Surveys and Questionnaires

If you feel like your data isn't enough, conducting a survey or questionnaire can be a solution. Tools like Google Forms or SurveyMonkey make it easy to create and distribute questionnaires. Have you tried them?

2. How to Extract Data

After finding a data source, the next step is to extract the data. Here are some methods you can try.

a. Using Automated Scripts

If the data is available on a website or API, you can create a script using Python. For example, use the requests library to call the API and pandas to store the data in a structured format. Want to learn how to create a script?

b. Web Scraping

If a site doesn't have an API, you can use web scraping. Be sure to check the site's scraping permissions. With a library like BeautifulSoup, you can easily extract information from web pages.

c. Data Extraction Tools

If you don't want to bother with programming, there are tools like Octoparse or ParseHub that can help you extract data. These tools are suitable for those of you who prefer visual interfaces.

d. Ensuring Data Quality

Collecting data alone isn't enough; you also need to ensure the quality of the data. Here are some tips for ensuring data quality.

Representative: Make sure the data represents the population you want to analyze. For example, if your analysis is for teenagers, collect data only from teenagers.
Clean from Duplication: Check and remove duplicate data to avoid bias. Try using functions in Pandas to check this!
Consistency: Make sure the data format is consistent. For example, if collecting date data, all data should be in the same format (such as YYYY-MM-DD).
Quality Testing: Check data quality with visualizations or descriptive statistics to find outliers.
Sufficient Data: Make sure the amount of data is sufficient for a reliable model. If limited, try data augmentation techniques or find more sources.

With the right approach to data collection, you can ensure the foundation of your machine learning project is strong and ready to move on to the next stage.

Exploratory Data Analysis

After data collection, the next, equally important, stage is Exploratory Data Analysis (EDA). EDA is the process of exploring cleaned data to gain insights and answer analytical questions. In this process, you will use various descriptive statistical techniques to discover patterns and relationships within the data. Furthermore, data visualization is often used to help understand these patterns.

Many data practitioners consider EDA to be an exciting stage in data analysis. Here, you can experiment with the data to discover valuable insights, identify anomalies, and formulate conclusions from the analysis results.

As an aspiring engineer, you will surely enjoy this EDA stage, as it is crucial and cannot be skipped. Before that, let's understand the difference between Exploratory Data Analysis (EDA) and Explanatory Data Analysis (ExDA).

The Difference Between EDA and ExDA

Exploratory Data Analysis (EDA) is the process of understanding the structure and patterns in data. Its primary goal is to discover hidden insights and relationships between variables without the need for an initial hypothesis.

In EDA, you'll use descriptive statistics techniques, such as calculating the mean and median, and create data visualizations, such as histograms and box plots. This process is flexible and iterative, allowing you to try different approaches to gain deeper insights. Typically, EDA is conducted by data analysts or data scientists who interact directly with the data.

Meanwhile, Explanatory Data Analysis (ExDA) aims to communicate the findings of data analysis to a broader audience, such as management or clients. Here, the focus is on conveying information clearly and convincingly.

ExDA uses purposeful visualizations and a strong narrative to support the conclusions drawn. The ExDA methodology is structured, starting with a problem statement and reaching clear conclusions. The audience for ExDA is typically stakeholders and decision-makers who need information to help them make informed business decisions.

For example, in a project aimed at identifying factors influencing sales, EDA is used to analyze the relationship between variables, such as price and promotion timing. During this process, you may discover unexpected patterns or anomalies in the data that might not have been previously apparent. The results of EDA typically include new insights, testable hypotheses, and a deeper understanding of the analyzed data.

After identifying key findings through EDA, the next stage is ExDA. Here, you will create a report explaining the factors influencing sales to management. This report will include simple graphs and clear narratives, so management can easily understand the information and make informed decisions based on the analysis. The output of ExDA typically takes the form of a final report, presentation, or dashboard that conveys the analysis results in an informative and easy-to-understand manner to the audience.

Data Preprocessing

Data preprocessing in machine learning can be likened to preparing ingredients before cooking. When you want to cook a dish, you can't just throw all the raw ingredients into the pot. You need to wash, chop, and process the ingredients to make them ready to serve.

Similarly, in data preprocessing, we need to clean and prepare the raw data—such as addressing missing data, handling outliers, and changing the data format—so that the data is ready for use by the machine learning model, just as prepared ingredients are ready to be cooked and served.

This process involves various techniques and transformations to ensure that the data used is high quality, consistent, and suitable for the analysis or modeling objectives. In other words, this step transforms and modifies the data features into a form that is easier for machine learning algorithms to understand and process.

The following are the steps typically taken in data preprocessing.

1. Handling Missing Data

Missing data, or missing values, is a common problem in datasets. You can solve this problem by deleting missing data or using imputation, which is replacing missing values with other values, such as the mean, median, or mode.

2. Addressing Outliers

Outliers are values that differ significantly from the majority of the data and can impact model performance. Some ways to address outliers include removing them, transforming the data, or transforming the values to more closely approximate a normal distribution.

3. Data Transformation

This step ensures the data is in a format suitable for the model. Common transformation techniques include:

Normalization: Changing the scale of data so that it falls within a specific range, usually between 0 and 1.
Standardization: Transforming data so that it has a distribution with a mean of 0 and a standard deviation of 1, often used for Gaussian-distributed data.

4. Categorical Variable Encoding

Because machine learning models can only work with numeric data, categorical variables need to be converted into numeric form. Some techniques used are:

Label Encoding: converting categories into numeric labels, suitable for ordinal variables.

One-Hot Encoding: converts each category into a separate binary column (0 or 1) for categories that do not have an ordinal relationship.

Ordinal Encoding: converting categories into numeric labels based on order or rank.

By following the steps above, you will ensure that the data used for the machine learning model is high-quality and ready for analysis.

Model Training

Model development is the stage of creating, training, and testing a machine learning model to produce accurate predictions. This stage is crucial because the quality of the model built will affect how well it performs in the real world. Below is a more detailed explanation of the model development process.

1. Data Split (Train-Test Split)

Before training a model, it is crucial to split the dataset into several parts so that we can objectively evaluate the model's performance. The main goal is to ensure that the model not only performs well on known data (training data) but also predicts well on new data.

This splitting is typically done as follows.

2. Training Set (Training Data)

The largest portion of the dataset is used to train the model and discover patterns. At this stage, the model "learns" from the data and looks for relationships between input and output variables. Typically, 70-80% of the dataset is allocated for training data.

3. Test Set (Test Data)

Around 20-30% of the dataset is allocated as test data. After the model is trained, the test data is used to evaluate whether the model can provide accurate predictions on new, previously unseen data. This evaluation helps determine how well the model will perform in the real world.

4. Validation Set (Optional)

The validation set is used when we want to perform hyperparameter tuning (adjusting model parameters). With the validation set, you can test parameter changes without using the test data, so the test data remains "pure" for the final evaluation. Typically, the validation set is taken from the training data, for example, 10-20% of the training data.

The following is an example of a data distribution scheme for a total of 1,200 data sets.

If without using validation data.

900 data sets → Training set

300 data sets → Test set

If using validation data,

700 data sets → Training set

200 data sets → Validation data (if needed)

300 data sets → Test set

Choosing a Supervised or Unsupervised Learning Algorithm

At this stage, you choose the type of machine learning algorithm based on the type of problem and the data you have. There are two main approaches: supervised learning and unsupervised learning.

1. Supervised Learning

Supervised learning is used when the dataset has clear labels or targets. This means that each training data set has a pair of input features (e.g., house area) and output or label (e.g., house price). This model learns by mapping the relationship between input and output, allowing it to make predictions on new data with similar patterns.

Supervised Learning Case Study

Classification: Classification is a method in machine learning used to group data into specific categories or labels. In classification, a model is trained to predict the class of input data based on existing features, with the goal of identifying the most appropriate category. For example, a model is created to determine whether an email is spam or not spam based on features such as specific words, the sender, or links in the email.
Regression: Regression is a method in machine learning used to predict continuous values based on input variables. In regression, the model is trained to estimate the relationship between features and target values, thereby generating numerical predictions. For example, a model can be used to predict house prices based on features such as house area, number of rooms, and location. Here, house price is the target continuous variable.

2. Popular Supervised Learning Algorithms

Classification

K-Nearest Neighbors (KNN): compares new data with the closest data in the dataset.
Decision Tree: creates a decision tree based on features to predict categories.
Random Forest: combines multiple decision trees to obtain more accurate predictions.

Regression

Linear Regression: finds the best straight line to predict continuous values.
Ridge Regression: a version of linear regression with regularization to avoid overfitting.

3. Unsupervised Learning

Unsupervised learning is applied to datasets without labels or targets. The goal is to discover hidden patterns, groups, or structures in the data. This model attempts to interpret the data without specific information about categories or output values.

Unsupervised Learning Case Examples

Clustering: Used to divide data into groups or segments. For example, companies can group customers based on their shopping patterns to develop more appropriate marketing strategies.
Dimensionality Reduction: Used to reduce the number of features in very large datasets while retaining important information. For example, the PCA (Principal Component Analysis) technique transforms existing features into new components for more efficient analysis.

Popular Unsupervised Learning Algorithms

K-Means: This algorithm finds the center point (centroid) and groups the data into clusters based on the closest distance to the centroid.
DBSCAN: This algorithm groups data based on the density of data points. It is suitable for use when the dataset contains outliers because outliers will not be assigned to any cluster.
Hierarchical Clustering: This algorithm forms a hierarchical structure by repeatedly creating clusters of the most similar data, until all the data is combined into one large cluster.
Model Training: At this stage, the model begins to "learn" from the training data. The training process is different for supervised and unsupervised learning. However, the model essentially tries to find the best pattern or function that connects features with outputs (for supervised models) or discovers hidden structures (for unsupervised models).

Model Evaluation

We have now reached the model evaluation stage. In this section, we will check whether the machine learning model we created is working well.

Imagine you're preparing a cake for your friends—even if you've followed the recipe correctly, you still need to taste the final product, right? The same is true with machine learning models. We need to ensure that the model not only "looks good" but also produces accurate and consistent results.

This process is crucial to ensure that the model performs well not only on the training data but also on new data. A model that looks great on training data may not perform as well in the real world because it may be too familiar with the existing data (aka overfitting).

Model Evaluation: Supervised vs. Unsupervised Learning

1. Supervised Learning Evaluation

In supervised learning, we measure how accurately a model predicts the correct labels using the test data. Some commonly used metrics include:

Accuracy: the percentage of correct predictions.
Precision & Recall: Precision measures the accuracy of positive results, while Recall measures how many positive cases are detected.
F1-Score: the harmonic mean of precision and recall to balance the two.
Confusion Matrix: displays the number of correct and incorrect predictions in tabular form (True/False Positives and Negatives).

Examples: predicting whether an email is spam or not for classification and predicting value differences for regression.

2. Unsupervised Learning Evaluation

Because there are no labels, evaluation in unsupervised learning focuses more on the structure and patterns in the data. Some common metrics are:

Silhouette Score: measures how similar objects within a cluster are to each other.
Inertia: the total distance between data points and the cluster center; the smaller the better.
Davies-Bouldin Index: the smaller the value, the better the clustering quality.

Example: grouping customers based on shopping habits.

With proper evaluation, you can ensure the model performs optimally, both for prediction (supervised) and pattern discovery (unsupervised).

Deployment

Deployment, or model implementation, is the stage where a trained and tested machine learning model is deployed in the real world for use. This model can be integrated into a web application, mobile application, API, or IoT device. Once deployed, it's important to regularly monitor its performance, as data can change over time.

If model accuracy decreases, retraining with new data is necessary. Additionally, optimizations such as model compression and load balancing are necessary to ensure the model runs quickly and efficiently, especially when handling multiple users. Equally important, data security and privacy aspects must also be considered to ensure model implementation remains secure and compliant with regulations.

Monitoring

Monitoring is the final, crucial stage after implementing a machine learning model. It requires you to actively monitor the model's performance in a real-world environment. This stage aims to ensure the model is functioning properly and producing accurate results over time.

Some things to consider in the monitoring process include the following:

1. Model Performance

Periodically evaluate model performance metrics such as accuracy, precision, recall, or F1-score, depending on the type of problem being solved. This helps detect if the model is starting to lose accuracy.

2. Data Drift

Pay attention to changes in input data that may occur over time, known as data drift. Data drift can affect the model's ability to make accurate predictions. For example, if consumer habits change, a previously accurate model may start producing irrelevant results.

3. Feedback Loop

Gather feedback from users and the system to continuously improve the model. This feedback can help refine features or even further develop the model.

4. Maintenance

Prepare a maintenance plan to retrain the model with new data to maintain its relevance and accuracy. This process can involve refreshing data, fine-tuning parameters, or replacing the model entirely if necessary.

With proper monitoring, you can ensure that your machine learning model remains effective, responsive to changes, and continues to add value to decision-making.

Read: How is AI being Implemented Today?

Conclusion

In conclusion, all stages of the machine learning (ML) workflow are crucial and interconnected. From preparing quality data, selecting the right model, building the model through training and testing, to evaluating and monitoring its performance in the real world, each step has its own role. Are you ready to delve deeper and apply this knowledge to your projects?