My Index Hides Your Secret

When implementing any data migration or ingestion project, the most important things to have in place are audit functionality, chain of custody, error management and reporting. That’s the case whether data is being pulled from social feeds, such as Facebook or LinkedIn; financial feeds, like Bloomberg or Reuters; or archiving systems, like EMC EmailXtender or Autonomy EAS. It's interesting to often see customers and partners very concerned with the features offered around data migration solutions—but what about the data itself? How do you know you are actually getting all of the data? What makes up the actual data you are trying to migrate?

The latter question is probably the most interesting. At a high level, you migrate the data from system or service A to B. Digging deeper, it's common for most systems to have a database as well as some actual messages or other data on disk. So what now seems like just data is actually two sources:

1. A database, usually containing custodian information that is critical for compliance

2. The messages, possibly in parts, on disk Without both, the data is incomplete and probably not compliant (although that can depend on the source system).

However, some source systems and services go further than this, storing data in a third component – the index. This might be surprising, but storing data in an index isn't actually that uncommon, particularly around compliance and the recipients or sender of a given message. That might include a point-in-time recipient set of the distribution group, or it might include specialist recipients or custodians that alter the compliance history in some form.

Data in an index usually solves one of two problems:

1. High-speed lookup access to data. Knowing the keywords in the body, subject and attachments as well as other indexable criteria around dates, size and such allow users to run a very quick search and hit the right data in (usually) less than a few seconds.

2. Retaining and storing data. This problem can be slightly more obscure. Whereas some indexes merely copy key data points from the actual data on disk, other some systems regard the index as a storage location, keeping the only copy of key parts of the data. This has significant implications for rebuilding indexes and is why many systems do not allow a full rebuild.

For migration or ingestion purposes, however, the main consideration is how this data actually comes out. One of the only ways to obtain this data is to ask the archiving system for it directly, which is where migrations and ingestions get interesting. Our research teams did a lot of testing and verification to understand to what extent data is actually contained in just the index—and the results were surprising, especially when cross-referenced with a system’s own database! It very quickly became apparent that running SQL queries against a database does not always produce the most compliant answer.

So, how does this data get extracted from the index? Well, it is not feasible to access an index on disk given the level of risk in interpreting the proprietary format. Therefore, access to the APIs is the only option. Knowing which APIs to use and how to harness those APIs correctly is something that we invest in for all of our products. Plus, we add another level of deep verification, double checking once the data has been reconstructed and ingested into the target. By considering all three sources for data (indexes, data on disk and databases), we ensure that migrations and ingestions using our products are 100 percent compliant – without a doubt!