Data Preparation

Machine learning (ML) models rely on data as input. In ML, data are divided into two major categories, namely Structured and Unstructured. The former includes Numerical and Categorical data, and the latter comprises Images, Audios, Videos, and Texts. Categorical data, a type of structured data, represent qualitative information that is divided into distinct groups or labels. They can be further classified into two subtypes: Nominal data (e.g., gender) and Ordinal data (e.g., rating scales) [1].

Feature Engineering

Each record in a dataset is called a data point (or sample). Each data point is composed of fundamental components, called features. The number of features and the relation between them play an important role in the performance of an ML model. Therefore, feature engineering is a crucial step in preparing the data before they are fed to the model.

The main focus of the current document is on structured data and image data. Therefore, we review feature engineering techniques relevant to these data types. Some techniques apply to both structured and image data; however, others are specific to images. Therefore, we explicitly highlight the image-specific methods in the corresponding subsections.

Handling Missing Values

Missing values are common challenges in ML datasets, where the feature values corresponding to some data points are missing. There are different methods to cope with the missing values issue [2].

Removing Rows with Missing Values

In this method, we remove all data points with missing values. Although this method is simple to implement and removes potentially problematic data points, it reduces the size of sample data and is vulnerable to introduce bias in the dataset provided certain groups are more likely to have missing values [2].

Imputation

Using the imputation method, the missing values are replaced by estimated values. There are two methods to impute the missing values [2]:

Mean, Median and Mode Imputation: The missing values are replaced with the mean, median, or mode of the corresponding feature. This method is simple to implement. However, it might reduce the accuracy of predictions [2].
Forward and Backward Fill: The missing values are filled with the nearest non-missing values from the same feature, where the forward fill method relies on replacing the missing value with the last observed non-missing value, and the backward fill approach replaces them with the next observed non-missing value [2].

Interpolation

Rather than relying on measures such as the mean, median, or mode (as in simple imputation), interpolation estimates missing values by leveraging the relationships between neighboring data points. This method is more complex to implement and depends on assumptions, such as the existence of linear or quadratic relationships within the data. However, it often yields more accurate results than imputation and better preserves data integrity by capturing underlying patterns or trends. Two common interpolation techniques are [2].

Linear: linear method uses a linear interpolation to estimate the missing values [2].
Quadratic: Quadratic interpolation method assumes a quadratic relationship between a missing value and its surrounding known values, and estimates the missing value accordingly [2].

Resizing

To standardize the shape of input images, they must be resized to a fixed size, such as 28 × 28.

Scaling

Data can have different value scales, where larger values may unintentionally dominate certain features. To prevent such issues during model training and evaluation, data are scaled to a specific range. Two common scaling methods are normalization, standardization, and log scaling [3]

Normalization

Normalization, also known as min-max scaling, wherein data are scaled into the range [0, 1] [3].

\[ x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}}, \]

where \(x_{min}\) and \(x_{max}\) are the minimum and the maximum values of the feature, respectively. It is worth mentioning that normalization does not alter the data distribution [3].

Standardization

Unlike normalization, which preserves the original distribution of the data, standardization, also known as Z-score normalization, transforms a feature so that it has a mean of 0 and a standard deviation of 1 [3].

\[ x_{scaled} = \frac{x - \mu}{\sigma}, \]

where \(\mu\) and \(\sigma\) are the mean and standard deviation of the feature, respectively [3].

Log Scaling

Features can exhibit a power law distribution, where low values of \(x\) correspond to high values of \(y\), and \(y\) decreases rapidly as \(x\) increases. An example of this is movie ratings, where a few movies receive many ratings while most receive very few. Logarithmic scaling can help mitigate the effects of a power law distribution by transforming the data into a more balanced scale [3].

\[ x_{scaled} = \log(x) \]

Binning

When the overall linear relationship between a feature and the label is weak or nonexistent, or when feature values are clustered, traditional scaling methods may fail. Binning, also known as bucketing, provides an effective alternative by converting numerical data into categorical data. This method groups numerical subranges into bins or buckets, which can better represent features that exhibit clustered or “clumpy” distributions rather than linear patterns [3].

Encoding

As mentioned earlier, categorical data represent qualitative information. Since ML models operate on numerical values, categorical data must be transformed into a numeric format through Encoding. Encoding converts categorical values into numerical representations and can be performed using methods such as One-Hot Encoding and Embedding Learning [3].

One-Hot Encoding

One-hot encoding assigns a binary vector to each category [3]. For example, if the feature is weekdays, each day can be encoded as:

Table I. One-Hot Encoding Example
Weekdays	Value
Monday	1	0	0	0	0	0	0
Tuesday	0	1	0	0	0	0	0
Wednesday	0	0	1	0	0	0	0
Thursday	0	0	0	1	0	0	0
Friday	0	0	0	0	1	0	0
Saturday	0	0	0	0	0	1	0
Sunday	0	0	0	0	0	0	1

Embedding Learning

Pre-trained models can provide embeddings, i.e., numerical vector representations, of input data. These models are particularly useful when capturing the semantic relationships between inputs is important.

Consisting Color Mode

In image processing, it is important for all images to have a consistent color mode. Common color modes include Grayscale, RGB (Red, Green, Blue), and CMYK (Cyan, Magenta, Yellow, Key/Black).

Data Augmentation

Data augmentation is a technique to increase the size and diversity of data samples for training purposes. To this end, the corresponding augmentation method generates new data from the existing data [4].

Tabular Data: For tabular data, new samples can be generated through techniques such as adding random noise to existing values, performing feature permutation (swapping values within the same column), or creating synthetic data based on the mean and standard deviation of the original data [4].
Images: For image data, new samples can be generated through augmentation techniques such as cropping, adjusting saturation, flipping (horizontal or vertical), and rotating the original images [4].

Data Balancing

If a labeled dataset is imbalanced, i.e., the number of data points varies significantly across categories, the model tends to learn patterns from the majority classes while underrepresenting the minority classes. Data balancing techniques, namely Upsampling and Downsampling, are used to address this issue and ensure fair learning across all categories [5].

Upsampling

Upsampling methods increase the number of samples in the minority class. Although upsampling increases the dataset size, it is vulnerable to data leak, which leads to model overfitting. Common upsampling techniques are listed as, random oversampling, synthetic minority oversampling technique (SMOTE), adaptive synthetic sampling approach (ADASYN), and data augmentation [6].

Random Oversampling

Random oversampling chooses data points randomly from the minority class and duplicates them. Random oversampling is vulnerable to model overfitting [6].

Synthetic Minority Oversampling Technique

The SMOTE generates new samples for the minority class by interpolation. First, for each minority class data point, the algorithm identifies its \(K\) nearest neighbors (with \(K\) commonly set to 5). Then, one of these neighbors is randomly selected, and a new synthetic sample is created at a random point along the line segment connecting the original data point and the chosen neighbor in the feature space. This process is repeated with different neighbors as needed until the desired level of upsampling is achieved [6].

Adaptive Synthetic Sampling Approach

The ADASYN technique extends the idea of SMOTE by focusing on regions where the minority class is underrepresented. A \(K\)-nearest neighbor (KNN) model is first built on the entire dataset, and each minority class point is assigned a "hardness factor" (\(K\)), defined as the ratio of majority class neighbors to the total number of neighbors in KNN. Similar to SMOTE, new synthetic samples are generated through linear interpolation between a minority data point and its neighbors. However, the number of samples generated is scaled by the hardness factor so that more synthetic points are created in regions where minority data are sparse, and fewer points are added in regions where they are already dense [6].

Data Augmentation

Data augmentation can also be applied as a strategy for balancing datasets [6].

Downsampling

Downsampling methods reduce the number of samples in the majority class to match the size of the minority class. While this approach can lower the risk of model overfitting, it also increases the likelihood of underfitting and may introduce bias by discarding potentially useful data. Common downsampling techniques are random downsampling and near miss downsampling [7].

Random Downsampling

Similar to random oversampling in upsampling, random downsampling selects data points at random; however, in this case, the selected points come from the majority class and are removed [7].

Near Miss Downsampling

Near Miss Downsampling involves distance-based instance selection. In this method, the pairwise distance between all majority and minority class instances is first calculated. Based on these distances, majority class instances that are farther away from minority points are removed. This ensures that the remaining majority samples are closer to the minority class distribution, helping the model better capture decision boundaries [7].

References

[1] A. Jonker and A. Gomstyn, “What are the key differences between structured and unstructured data?,” International Business Machines (IBM), accessed: 2025, https://www.ibm.com/think/topics/structured-vs-unstructured-data.

[2] GeeksforGeeks, “Ml | handling missing values,” GeeksforGeeks, accessed: July 21, 2025, https://www.geeksforgeeks.org/machine-learning/ml-handling-missing-values/.

[3] Google, “Ml concepts - crash course,” Google, accessed: 2025, https://developers.google.com/machine-learning/crash-course/prereqs-and-prework.

[4] Z. Wanget et al., “A comprehensive survey on data augmentation,” 2025, https: //arxiv.org/abs/2405.09591.

[5] GeeksforGeeks, “Introduction to upsampling and downsampling imbalanced data in python,” GeeksforGeeks, accessed: July 23, 2025, https://www.geeksforgeeks.org/machine-learning/introduction-to-upsampling-and-downsampling-imbalanced-data-in-python/.

[6] J. Murel, “What is upsampling?,” International Business Machines (IBM), accessed: 2025, https://www.ibm.com/think/topics/upsampling.

[7] J. Murel, “What is downsampling?,” International Business Machines (IBM), accessed: 2025, https://www.ibm.com/think/topics/downsampling.

Forough Shirin Abkenar