To me one of the biggest news delivered during the conference was the new generation of Documentum full text indexing called the Enterprise Serch Server (ESS). This marks the first official message that EMC Documentum will move away from the OEM-version of FAST ESP which has been in use since Documentum 5.3 (2005). The inclusion of FAST back then meant that Documentum got a solution where metadata from the relational database where merged with text from the content file into an XML-file (FTXML) that could be queried using DQL. Before diving into the features of the new technology I guess everyone wonders about the reason for this decision. The main reasons are said to be:
- Performance. 1 FAST Full-text node supports up to around 20 Million objects in the repository (some customers commented that their experience were closer to 10 M…) and it requires in memory indices. With Documentum installations containing Billions of objects that means 100+ nodes and that has been a hard sell in terms of hardware requirements.
- Virtualisation. Apparently talks with Microsoft/FAST about the requirement on supportin all Documentum products on VMWare made no progress. This has been a customer demand for some time. MS/FAST cites intensive I/O-demands as a reason why they where not interested in certifying the full-text index on virtualisation.
- More flexible High Availability (HA) options. Today FAST can be clustered by adding new nodes which leads to a requirement of having the same amount of nodes for backup/high availability.
From a performance stand-point I personally think that the current implementation of FAST lead to slow end-user experience when searching in Documentum. One reason for this is that a search is first triggered to FAST which then delivers a search result set irrespective of my permissions. Instead the whole result set must be filtered by quering it towards the relational database. That takes time. This is also a reason why we have integrated an external search engone based on the more modern FAST ESP 5.x server with Security Access Module which means that acl:s are indexed and filtering can be done in one step when searching in the external FAST Search Front-end (SFE). More about how that is solved in ESS later on.
From a business perspective EMC outlines these challenges they see a need to satisfy:
- End users expect Google/Yahoo search paradigms
- IT-managers want low cost, scalable, ease of deployment and easy admininstration.
- Requirements for large scale, distributed deployments with multiingual support.
- Enterprise requirements such as low cost HA, backup/restore and SAN/NAS-suppprt.
New new ESS is based on the xDb technology coming from the aquisition of the company X-hive and leveraging the open source full-text indexing technology in the Lucene project. The goal for ESS is to leverage the existing open indexing architecture in Documentum. The idea is both to create a solution that really scales but of course with some trade-offs when it comes to space vs query performance.
ESS supports structured and unstructed search by leveraging best of breeed XML Database and XQuery Standards. It is designed for Enterprise readiness, scalabiity, ingestion throughput and high quality of search as core features. It also provides Advanced Data Management (enables control where placement of data on disk is done) functionality necessary for large scale systems. The intention is to give EMC to continue to develop and provide new search features and functionality required by their customer base.
It is architected for greater scalability and gives smaller footprint than current Full-Text Search as well as scale both horisontally (more nodes) as vertically (more servers on the same node). It is designed to support tens to hundreds of millions of objects per node.
This allows for solutions such as Archiving where there can be Billion+ emails/documents while preserving the high quality of search while still achieving scale. The query response time can be throttled up or down based on needs – priority can be shifted between indexing and quering.
The installation procedure is also simplified and EMC promises that a two node deployment can be up and running in less than 20 minutes. The solution is also designed to easily allow to add new nodes to an installation.
ESS is much more than a simple replacement of the full-text engne. It will focus on deliver these additional features compared to existing solutions:
- Low cost HA (n+1 Server based)
- Disaster Recovery
- Data Mangement
- VMWare Support
- NAS Support
- New Administration Framework
The new admin features includes a new ESS Admin interface which has a look and feel very similar to CenterStage. Since the intention is to support ESS on non-Documentum installation it is a separate web client. The framwoork also supports Web Services, Java API, JMX and it is open for administration using OpenView, Tivoli, MMC etc.
The server consists of:
- ESS API
- Indexing Services will have document batching capability, callback support for searchable indication and a Content Processing Pipeline with text extraction and linguistic analysis via CPS.
- Search Services. This will provide search for meta-data, content or both (XQuery based) as well as multiple search options such as batching, spooling, filters, language, analyser etc. It will return results in a XML format and provides term highlight, summary and relevancy. The thread execution management support multi-query and parallell query. It also includes low level security filtering.
- Content Processing Services is responsible for language detection, text extraction and linguistic analysis. The CPS can be local or remote (co-located with content for improved performance). It will have a pluggable architecture to support various analysers and/or text extractors. It will include out of the box support for Basis RLP and Apache SnowBall analysers. However only one analyser can be configured per ESS. (My question: Can I have different analysers on different nodes?). Content Processing can be extended by plugins.
- Node and Data Management Services is the primary interface for all data and node management within ESS. It provides ability to control routing of documents and placements of collections and indices on disk. It deals with index management and supports bind, detach, attach, merge, freeze, read-only etc.
- Analytics includes API’s and Data model for logging, metrics and auditing, ingestion and search analysis and facet computation services.
- Admin Services. The example shown was really powerfull very an admin could view all searches made by a user by time and see what time it took to first result set. The one with a longer time could be explored by viewing the query to analyse why it took so long.
Below that the xDB can be found and in the botton the Lucene indices. The whole solution is 100% Java and xDb stores XML Documents in a Persistend DOM formats and support XQuery and XPath. Indices conists of a combination of native B-tree indices + Lucene. The xDb supports single and multi-node architecture and has support for multi-statement transactions and full ACID support. In additon it supports XQFT (see introduction it here) which is a proposed standard extension to XQuery which includes:
- LQL via a full text entension
- Logical full-text operator
- Wildcard option
- Anyall options
- Positional filters
- Score variables
ESS includes native security which means that security is replicated into the search server and security filtering is done on a low level in the xDb database. This means effective searches on large result sets and enables facet computation on entire result sets.
Native facet computation is a key feature in ESS which is of course linked to the new search interface in CenterStage which is based on facets in an iTunes-like interface. Facets are of course nothing new but it is good that EMC has finally realised that it is a powerful but still easy way to give users “advanced search”.
ESS Leverages a Distributed Content Architecture (for instance using BOCS) by only sendning the raw text (DFTXML) over the network instead of the binary file which can be very much larger in many cases (such as big PowerPoint files). ESS also utilizes the new Content Processing Services (CPS) as well as ACS.
The new solutions also makes it possible to do hot backups without taking the index server down before as it is today. Backup and restore can be done on a sub-index level. The new options for High Availability include:
- Active/active shared data (the only one available for FAST)
- Active/passive with clusters
- N+1 Server based
Things I like to see but have not heard yet:
- Word frequency analysis (word clouds based on document content)
- Clustering and categorisation (maybe done by Content Intelligence Services)
- Synonym management
- Query-expansion management
- How document similarity is handled by vector-space search (I guess done by Lucene?)
- Boosting & Blocking of specific content connected to a query
- Multiple search-views (different settings for synonyms, boost&blocking etc)
- Visualisation of entity extraction and other annotations
- Functionality or at least an API to manually edit entity extraction within the index. Semi-automatic solutions are the best.
- Freshness management.
- Speech-to-text integration (maybe from Audio/Video Transformation Services)
Personally I think this is a much needed move to really improve the internal search in Documentum and make much better use of the underlying information infrastructure in Documentum. It will be interesting to see what effect this has on Microsoft/FAST ambitions to support the Documentum connector. Maybe the remaining resources (no OEM to develop) can focus on bringing the connector from an old 5.3 API to a modern 6.5 API. I still see a need for utilising multiple search engines but as ESS gains more advanced features the rationale for an expensive external solution can change. The beta for Content Intelligence Studio will be one important step in outlining the overall enterprise search architecture for big ECM-solutions. In this lies of course tracking what Autonomy brings to market in the near future.
Another thing worth mentioning is that I during the past four conferences have heard quite a few complaints about the stability of the current FAST-based full-text index. It crashes/stops reguarly and often without letting anybody knowing it before users start complaing about strange search results.
A public beta will be released in Q3 2009 and customers are invited to participate. Participants will recieve a piece of hardware with the ESS pre-installed and pre-configured and after a few configuration changes in Content Server it should be up an running.
Customers will have the option of upgrading existing FAST full-text index or run the new ESS side-by-side with FAST. ECM will also market ESS for non-Documentum solutions.