Data Lakes for Modern ETL

Every business has the need for data analytics.  If you have not adjusted your architectures around Extract Transform and Load, you are falling behind.  Even the acronym for this type of data processing has changed.  The process is no longer referenced as ETL but ELT which stands for Extract Load and Transform.  Companies today should start to transition their data analytics architectures to include Data Lakes.  Data Lakes provide storage of both structured and unstructured business data. 

If you are not using a Data Lake today you most likely have an Enterprise Data Warehouse.  I probably do not have to mention the challenges with an Enterprise Data Warehouse especially if you are using one.  You know the issues because you are exceeding SLAs or Service Level Agreements for reporting, you are struggling to consume new data sources and you have a team of individuals that are stressed out trying to support ETL processes to meet business demand.  It does not have to be this way but you will need to make the move to adjust your data analytics architecture.  Here is a depiction of a legacy or traditional data analytics architecture showing the process of Extract Transform and Load:

Legacy ETL Processing

So this looks familiar to you and unfortunately so are the side effects mentioned earlier.  What is the solution and how long would it take to implement?  The solution is to create a Data Lake to store your operational data to be used for impact analytics.  Notice that I used the word Lake and not Swamp.  It is not enough to just create a Data Lake and dump data into it.  The Lake has to be managed or it will eventually become a data swamp.  With a managed Data Lake you will be able to ingest new data sources faster than ever before with new insights and discoverability beyond what is available today.  All while maintaining active security controls with options for more granular security based on roles in given business groups.

I would like to share with you what I believe to be the most modern approach to ETL and Data Lake architectures available right now.  This is Data Lake Formation from Amazon Web Services:

Amazon Data Lake Formation

Data Lake Formation – What business value does this architecture provide?

Data Lake Formation is service that makes it easy to set up and secure a data lake.  Data Lake Formation was created to collect, store, analyze and share data at any scale.  Since Data Lake Formation is a cloud service offering you will be getting all of the business value of cloud services.

More focus on new business capabilities

Data Lake Formation provides automation for orchestrating the data lake creation from beginning to end.  Data Lake Formation gives you the complete process data ingestion to transformed business analysis.  Staring with registering an AWS S3 (Simple Storage Service) bucket or several other ingest options, you can build an automated AWS Glue workflow.  The AWS Glue workflow will load and transform the data storing it in a specialized flat columnar formats called Parquet which was built for more efficient storage and performance.  The automation of these steps offers less time learning an AWS product and focus on your specific business and related product analytics

Reduced engineering costs

Legacy ETL processes are expensive to maintain.  Typically, the entire process of ingestion, transformation and loading will have to be done for all data sources and replicated every time a new source needs to be added.  From then the maintenance becomes expensive because any changes in the external data models will cause needed work for each extraction process.  Most if not all legacy ETL processes will store data in tables with a very well defined structure.  Any data model changes will also need to be reflected in the storage layer.  With Data Lake Formation you can store both structured and unstructured data.  Data Lake Formation gives you the ability to bring in new unstructured data sources and perform analytics quickly.  From ingest to exposing data for consumption with standard SQL (Structured Query Language) through tools like AWS Athena is possible in minutes.  Data Lake Formation allows your engineers to get more done in less time thus saving on engineering costs

Modern architecture with flexibility

Everyone knows getting all the details right the first time is very difficult.  Data Lake Formation can help with that.  With Data Lake Formation you can make a mistake and recover without a huge overhead.  With automation and cloud infrastructure you natively have flexibility in a way never before found with legacy ETL processes.  AWS has many built in connectors and services that are tailored to serve the needs for modern ELT processes.  When you have made missed the mark you can quickly select other options to get the most out of your modern ELT process.  With Data Lake Formation you get all of the traditional AWS flexibility running a modern ELT process.

Ability to scale up to meet demand

What happens when you cannot process your analytics batch processes within the batch window?  What are the options?  Violate the Service Level Agreement? How about create another environment to process in parallel?  All of these options can be very expensive both in dollars and time.  Time most likely being the greatest obstacle.  When using Data Lake Formation you have the ability to scale out either compute or storage or both at will.  There are many options for scaling your workload when using Amazon Redshift.

How long would it take my business to implement this architecture?

This is an interesting question because so many factors may go into giving an answer.  However, no doubt you will be able to save weeks if not months getting your modern ELT process running when using Date Lake Formation.  There really is no fair comparison when using the automation of Data Lake Formation.  It would be hard to implement a solution that provides durable storage with replication, simplified ingestion and cleaning with fine grained permissions and an integration toolset quicker than can be done with Data Lake Formation.  In general, most would agree you can get a modern ELT process running with AWS Data Lake Formation within days.