Dedupe Portal

Web App that provides a rich interface to deduplicate people data using active learning techniques in the backend.

Collaborators: Anshul Goyal, Vinay Gedam.

Background

Data deduplication – often called intelligent compression or single-instance storage – is a process that eliminates redundant copies of data and reduces storage overhead. Data deduplication techniques ensure that only one unique instance of data is retained on storage media, such as DB, disk, flash or tape. Redundant data blocks are replaced with a pointer to the unique data copy.

Objectives

In this project, we aim to deduplicate people data to consolidate user profiles collected from multiple paltforms. We identify and merge all duplicates to create one enriched copy and delete all other redundant copies. The project uses dedupe library for the same and provides a web interface over it which takes a csv file as input, performs deduplication using active learning technique and finally downloads the deduplicated csv file.

Products

  • Web App