Last Updated: April 2023
In machine learning, we use data to make predictions or classifications. For example, we might use data about houses (like the number of bedrooms, the size of the backyard, and the location) to predict the price of the house.
However, not all of the data is equally important for making the prediction. Some pieces of data might be more relevant than others. These relevant pieces of data are called "features". So, for the house price prediction example, the number of bedrooms, size of the backyard, and location are all features.
Feature engineering is the process of figuring out which features are the most important for making accurate predictions, and then transforming them into a format that a machine learning algorithm can use. It's kind of like picking out the most important pieces of a puzzle and putting them together in a way that makes sense.
As another example, a dataset of credit card transactions may contain features such as transaction ID, date/time, amount, merchant category code, cardholder name, card type, card number, billing zip code, authorization code, and fraudulent transaction indicator.
The above examples all show features that humans can understand. However, feature engineering can result in features that are not easily understandable by humans due to complex mathematical transformations or combinations of features (for example, multiplying the number of bedrooms with the size of the backyard). Machine learning algorithms such as neural networks can learn abstract representations of the data that may not have an obvious interpretation for humans. While these features can improve model performance, it is important to balance complexity with interpretability, especially in sensitive applications where understanding the model's decisions is important.