I just read a blog post by Chuck Hollis at EMC about data deduplication and remembered how important that is to an EMC-platform. However, to me it is not just a way to improve or optimize storage, it is a critical part of any information architecture.
Data deduplication should be a core feature in platforms such as EMC Documentum which put the use of this feature up on the business side of things. Since everything stored in Documentum is an object which may or may not have an attachment in the form of a document that objekt can be exposed to users in one or many folders. The key thing is that these linkages opens up for interesting ways of using data deduplication.
Imagine a corporate environment with thosands of users. A lot of important documents in the company will be used many times by many people in different contexts. Since many of them likely are used as references in different projects such as corporate strategy, marketing documents so forth they are essentially read-only. However since a lot of users need these exact documents they will be imported many times in the repository and not only taking up unneccessary space but also create problems when these documents are updated. People will ask “hey, are all of these documents I found the same version?”
So my solution would be that we have a job running on import that highlights to the user that this particular document is already available and ask the user if they want to use the existing one instead. That renders a link being created to that document in their Folder or Project space.
We can also continously run a job doing reports on the the current status of the repository to see how many duplicates we have and what kind of content is duplicated the most. Documentum Reporting Services could be used to do that for instance. If we want a proactive Knowledge Management function they can either consolidate that directly or create tasks to users asking them if they agree to deduplicate some of their content. However, we need to push hard to have someone to create a really cool and usable interface to manage these “content conflicts”.
This will further help companies manage vital documents and further reduce the confusion of which document is the correct and updated one.
From a technology stand-point the first step would be to use a simple hash function to find exact duplicates but the next step should be to use vector-based indexing technology found in both Autonomy and FAST ESP to also detect similarity levels and possibly use that for further refinement of similar content. That way we de-duplicate the same content found in different formats and have the option of removing one of them or maybe just make one the rendition of the other.