Existing Enterprise Data Warehouses (EDW) are reaching capacity quicker than ever. Ever increasing data growth, doubling every two years, is now limiting what can be added to data warehouses. The increase in data volume has also increased processing times and maxed out load windows.
Long load times are hindering the ability to deliver critical business information in a timely fashion. Investing in additional EDW resources to increase capacity is an option, but offloading data and processing to Hadoop is more cost effective.
Cold Data—infrequently used, inactive or dormant data—consumes storage in the EDW with little benefit to day-to-day reporting. With an increased business desire to store “just in case” data, the amount of Cold Data that sit in today’s data warehouse is just continuing to increase.
Offloading Cold Data to Hadoop leads to more meaningful business insights from more of your data while freeing up resources from existing systems.
- Hadoop is a fraction of legacy EDW compute and storage cost.
- Migration of Cold Data returns storage to the EDW allowing growth without the need for additional resources.
- Accessing raw data that was once restricted in the EDW given CPU utilization / performance considerations.
Utilize Hadoop cluster for process intensive workloads
Addressing load windows that are hitting capacity, enterprises can leverage the cluster processing of Hadoop to offload some of these processes. Long running ETL processes can be ported to and executed in the cluster environment resulting in faster processing time and a lower cost than in the EDW.
Processing raw data in Hadoop presents companies with opportunities to utilize more data and data sources not traditionally stored in the EDW. Once the data is transformed, the results can be loaded back into the existing EDW for further consumption.
Offloading to Hadoop is not a trivial exercise. Strategically assess, design, and then plan how to effective deploy Hadoop capabilities into the solution architecture.
Data temperature. A holistic review of the EDW should be done to determine which objects are most used and which are the least used. This exercise can range from reviewing last access times found in system logs to performing custom analytics and surveys of users.
Load processes. When determining candidates for ETL processes to offload to Hadoop, run-time should not be the only factor. Data required for the process needs to be accounted for. Processes that use raw source data are ideal, while processes that depend on other EDW transformations may not be the best candidate due to upstream ETL dependencies. However, utilization of EDW data in the process doesn’t exclude the process from being offload. It still may be faster to export data from the EDW to Hadoop, using the cluster processing power, and then reloading it back into the warehouse.
Planning and implementation. Offloading to Hadoop should be an incremental process. Objects that are candidates for offloading should be perform in iterative implementations. Porting of existing ETL to Hadoop can be a significant development effort. Vendors, like Informatica, have provided tools to help port traditional ETL mappings to ones that would run on a Hadoop cluster.
Hadoop is a key component of the next-generation data architecture.
Hadoop can relieve current overloaded EDWs as well as provide next-generation analytics capabilities for the future. If your EDW is close to capacity, consider investing in the next step of your data journey with Hadoop.