![]() Therefore, we followed a truncate and load process. In this specific use case, with every refresh, even historic data was updated, and therefore a lot of data could not be appended. The dataset under consideration in this scenario was 700 GB in Parquet format. The data that ZS used was pertinent to the Pharma domain. High data volume – Truncate load 700GB data in Parquet format ZS carried out PoCs to optimize the environment subject to these constraints. These were fixed and not adjustable: a 2 node ds2.8xlarge cluster. The client IT teams determined the cluster size and configuration, and took into consideration the cost, data volumes, and load patterns. Fixed cluster size – Pre-decided 2 node ds2.8xlarge cluster ![]() In this scenario, the client team had moved from another vendor to AWS, and the overall client expectation was to reduce costs without a significant performance dip. The client IT and Business teams set a strict SLA to load 700 GB of Parquet data (equivalent to 2 TB CSV) onto Amazon Redshift and refresh the reports on the MicroStrategy BI tool. Weekly time-based SLA – Load within 1 hour and fetch data on MSTR within 1.5 hours The following diagram outlines the three constraints that ZS had to solve for to meet client needs: Under default, unoptimized state data load from S3 to Amazon Redshift and MicroStrategy refresh (Step 4 in the previous diagram) took almost 13–14 hours on a 2node ds2.8xlarge cluster and was affecting the overall weekend run SLA (1.5 hours). The five-step process for data refresh and insight generation outlined previously takes place over the weekends within a stipulated time frame. The total dataset is approximately 2 TB in size in CSV format and approximately 700 GB in Parquet format. ZS used 40 such tables they sourced the data in these tables from various healthcare data vendors and processed them as per reporting needs. Table 1Įach table has approximately 35–40 columns and holds approximately 200–250M rows of data. ![]() The following table demonstrates the data’s typical structure-it has several doctor, patient, treatment-pertinent IDs, and healthcare metrics. In this specific scenario, ZS was working with data from the pharma domain. Step 5: This data is read from Amazon Redshift and the insights are displayed to the end-users in the form of reports on MicroStrategy. Step 4: 700 GB of data is then ingested into Amazon Redshift for MSTR consumption. Step 3: Post the processing, data is stored in Amazon S3 buckets for consumption of downstream applications. Step 2: Cost-effective transient clusters are spun as needed to provide compute power to execute pyspark codes. Step 1: Pharma data is sourced from various vendors and different systems like FTP location, individual systems and Amazon S3 buckets etc. The following diagram shows the overall data flow from flat files to reports shown on MicroStrategy for end-users. ZS infrastructure is hosted on AWS, where they store and process pharma industry data from various vendors using AWS services before reporting the data on a MicroStrategy BI reporting tool. ![]() This post provides an approach for loading a large volume of data from S3 to Amazon Redshift and applies efficient distribution techniques for enhanced performance of reporting queries on a relatively small Amazon Redshift cluster. ![]() We carried out experiments to identify an approach to handle large data volumes using the available small Amazon Redshift cluster. ZS has strict, client-set SLAs to meet with the available Amazon Redshift infrastructure. The reporting-specific data is moved to Amazon Redshift using COPY commands, and MicroStrategy uses it to refresh front-end dashboards. They processed this data using transient Amazon EMR clusters and stored it on Amazon S3 for reporting consumption. ZS sourced healthcare data from various pharma data vendors from different systems such as Amazon S3 buckets and FTP systems into the data lake. ZS was involved in setting up and operating a MicroStrategy-based BI application that sources 700 GB of data from Amazon Redshift as a data warehouse in an Amazon-hosted backend architecture. ZS engagements involve a blend of technology, consulting, analytics, and operations, and are targeted toward improving the commercial experience for clients.” In their own words, “ZS is a professional services firm that works closely with companies to help develop and deliver products and solutions that drive customer value and company results. ![]()
0 Comments
Leave a Reply. |