- No Comments
The Data Deduplication is an algorithm developed by OrderStack which has helped many companies get rid of the duplicates in their master tables.
- Makes the master tables more clean and gets rid of redundant and duplicate rows.
- Helps keep only required data in the master tables.
- The algorithm is automated and can work on data with tables with large amount of data.
- The algorithm can remove upto 30% duplicates in the table with about 1-2% false positives.
As part of our process, we would require a sample of the master table first to fine tune our thresholds for the deduplication process. After the sample table yeilds adequate results, the algorithm is run on the main master table.
The data is analysed thoroughly and the thresholds are accordingly set before running the data through the algorithm.
Based on our plan, we led the development in the following manner.
- The original data is not mutated at all.
- The data is appropriately cleaned thus making the deduplication process faster and efficient.
- After cleaing the data, the data is run through the algorithm where a unique Id is assigned to each duplicate row.
- After marking each row as unique and duplicate, the data is merged resulting in removal of the duplicate data.
- The dinal output would contain only the required and appropriate data resulting in a more clean and useable master data.
A pilot is run first with a sample data of the master table. This process takes a day or two depending on the complexity of the data.
After the pilot is done, the algorithm is run on the full master data. The final output is cleaned and normalized without the original data being mutated. The output of the file may take 1-2 weeks depending on the complexity and amount of data.