I have always just used the Data Cleansing tool to fill in null values with zeroes. I have never really used the imputation tool and am not sure when or why it would be used, and what advantages it present.
Could someone point out some use cases and functionality of the imputation tool that merits its existence?
When it comes to Nulls, the Data Cleansing Tool can only do one of two things:
- Replace with a blank if it is a string field
- Replace with a 0 if it is a numeric field
On top of that, the Data Cleansing tool can remove unwanted characters are modify the case of a string. If I am just wanting to replace null values with either a zero or a blank, then I will always use the Data Cleansing Tool. Often it won’t hurt to make use of the tools other functionalities at the same time.
The Imputation Tool is useful if you are wanting to replace Null values (or any user-specified value) with either:
- The average of the field
- The median of the field
- The mode of the field
- A user-specified value
Note that this tool can only be used with numeric fields. It is useful to target and change specific incorrect values or replace null data (which could be missing inputs) with an average of a field so that analysis can be done with more data.
It is also very handy that it includes the options to output new columns with all of the imputed values if you would like to do comparative analysis on the data sets with or without imputed data. Finally, the Imputation Tool allows you to output the imputed columns with a flag of zero (no imputation on that cell) or one (cell was imputed) if you would like to keep track of how many cells were imputed after the fact.