When implementing a CRM system for an established Enterprise organisation, there are many systems to gather data from, to be able to provide a comprehensive customer view to the end-users of the software.
The 360 degrees customer view is normally obtained by combining real-time integrations with external systems (such as accounting software, ERP, recommendation engines etc.) with an extensive process of data migration including data mapping, data scrubbing, manipulation and import from existing systems that will eventually be decommissioned.
Today I will focus specifically on the subject of performing a sizeable initial data load from a system to Sugar’s CRM system.
When approaching complex and large data migrations, usually the organisation’s database administrators get involved as well. With DBAs point of view being mostly data centric, the preferred approach is normally to understand Sugar’s DB schema, slice and dice the data, and finally migrate data from database to database.
The challenge with this approach is that Sugar is not just a data placeholder, but a complex web based application where the application layer can have run-time interactions with the data as well (eg: by creating related entities wherever required, by communicating data changes to downstream systems like Elasticsearch etc.). Based on my experience, I found that direct database queries normally add too much complexity and unnecessary overhead to the project, by trying to understand in full and possibly re-build the application layer, to complete the missing actions after the actual data import.
Custom bulk import upsert API
The next logical step is to look at the application’s provided RESTful API. When the amount of data starts to stretch import times with just the CRUD Sugar API, it is possible to enhance the rich Sugar framework with additional custom RESTful APIs that can help us with the import process’ timing.
Having implemented numerous initial data loads as well as bulk imports, the most common approaches to enhance the import data speed are the followings:
- Reduce the number of API calls required to achieve the final outcome
- Increase the number of records that can be processed at any one time with a single API call
- Decrease lookup time when dealing with existing records
- Decrease save time by removing or queueing in the background non-essential actions during import
- Execute data imports in parallel wherever possible, for independent entities
With the above points in mind, I went ahead and built a bulk upsert (update or insert based on the lookup unique key) tool for the SugarCRM Community to help with the import process and shared it on the SugarCRM Labs BulkImport git repository.
The API accepts multiple records per request and it is optimised for performance. Is it possible to configure:
- Modules available for import by the Admin users via the bulk upsert API
- SQL lookup statement to execute for each record, based on a known external unique identifier, to proceed either with an update or an insert
- External unique identifier location within Sugar
- Custom logic definition either before and/or after processing of each record
The example can cover most use cases either with Sugar being the master of the imported data going forward, or with an external system still being the data master.
To achieve the best throughput I recommend to disable all non business critical operations during initial data loads (eg: if not required on initial load disable all workflows, logic hooks and other synchronous actions).
The clear benefits brought by the custom API for extensive data import and initial data loads are summarised below:
- Flexibility, by leveraging in full the underlying framework
- Multiple records can be inserted with one REST call
- Performance is greatly improved by implementing various strategies (eg: upserts, multiple operations in one call, multiple records per call, completing lookup with sql queries only, disable activity streams for imported records etc.)
- The application layer’s logic is triggered automatically (eg: audit logs, logic hooks, workflows)
- Related and downstream systems are automatically made aware of data changing (eg: Elasticsearch)
- Accessible only to Admin users
- It provides a great starting point to build upon. It will work for most scenarios as-is with minor configurations
To conclude the post, I wanted to list few additional thoughts about what else needs to be taken care of, when completing complex data migration and imports:
- Start building the ETL import processes really early, as it is really crucial to get it right to not miss go-live deadlines. It might even help your architects find the best data structure for the new system by proceeding with agile iterations. Basically treat the data load as a separate functional stream that evolves while additional functionality is delivered during the project’s sprints
- When building the ETL process, make sure all entities are imported in the right order (start from the entities with no data dependencies and when dealing with entities with dependencies, make sure they are all met)
- The ETL process should be optimised to insert as many entities in parallel as possible, as long as they are unrelated to each other
- Define with the business if there are portions of data (as in full entities, or even just a subset of records) that can be imported after go-live, so that those entities can trickle slowly after go-live, decreasing the required downtime
- The overall process should be initially tested with a small working data set that will need to be produced for the specific purpose
- The process should be tested, tweaked and optimised multiple times, until the full data load duration is acceptable for both IT and the business
- During data load tests, make sure the system’s infrastructure is sized correctly to sustain both the user’s load and the data import load as well
- Take care of disabling all unnecessary logic that should not be triggered during data import such as new workflows, customer welcome emails, record’s assignment emails etc.
- Make sure the DBAs are involved to actively find potential database bottlenecks and optimise your lookup queries
- Successful data load processes require a reliable way to compare data input vs data output to make sure that only actual incomplete (as in broken) data is not imported, and all the remaining records match. This can easily be a database to database comparison tool that can flag differences
All the best with your next data migration