Know-How
This section provides you with all the necessary information to understand and complete the Module 3 - Advanced Data Massaging section which involves advanced data transformations using LOGIBLOX tools.
They are specifically designed to make your tasks as convenient and comprehensible as possible.
Guide on datasets massaging / cleansing
Imagine you're after understanding a company's sales position and potential better. This goes for products and customers obviously. While it does look like a straightforward task, it can be challenging to accurately locate and relate information if the database is not formatted the way you anticipated.
Even worse, your database may include redundant fields and missing/corrupt data.
So, let's tackle it and make it much simpler and more automated with LOGIBLOX.
What exactly is data massaging? And, how do you massage data?
What is Data Massaging? Data massaging, also known as data cleansing or data scrubbing process, is a way to eliminate unnecessary information from data or cleanse a dataset to make it usable.
It involves processing data to change data formats, remove unwanted characters, duplicates, whitespace, and more. Simply put, data massaging is the ‘transformation’ step in the ETL process.
Applying Massaging Techniques on Data
Some common data massaging techniques that convert data into useable form include:
-
Changing the format of the source data to make it compatible with the target system (for example, changing date format from dd/mm/yyyy to mm/dd/yyyy).
-
Replacing missing values with defaults (for example, entering ‘0’ whenever a quantity is not given).
-
Filtering out data that is not desired in the destination system.
-
Checking the validity of data and fixing records that can generate errors (for instance, removing special characters like * ^ & that make data invalid).
-
Standardizing data to get rid of variations (for example, replacing upper case with lower case or replacing ’01’ with ‘1’).
Why Is It Important to Massage Data?
According to IBM, 80% of a data scientist’s time is spent in preparing, cleaning, and organizing data, leaving only 20% time to analyze it.
This is because enterprises usually generate a huge volume of data from different sources, which can have imperfections due to redundancies or inconsistencies. To make this data usable for analysis, it has to be cleaned, formatted, and standardized; otherwise, the results will be skewed.
This is where data massaging comes into play.
By transforming, cleaning, integrating, and normalizing data, you can ensure the accuracy of data and subsequently, your decision-making.

