How To Create Data Lake?
What Do We Mean By Data Lake?
In this article, I will be talking about a commonly used term data lake. So what do we mean by data lake? Obviously we are not talking about an actual lake here. I hope I can enlighten you guys about this topic.
Data lake is a place where you can store both structured and unstructured data. You can store the data here in raw format without a need to structure it. From here you can use analytics on your data to analyse it. You can store as much as you want here even though the data you don’t need could be stored in the data lake for the future purposes. Although you can store as much as you want in the data lake, this does not mean that you should put every data you have here.
Why Data Lakes Are Useful?
In today’s competitive business environment, many successful enterprises are data driven and they all have their data stored in their data lakes. This allows them to outperform their peers who are not utilising the data lakes. Implementing data lake allows companies to store their data indefinitely in a cost efficient way. Storing the data allows companies to run various analysis on the stored data, giving data team a freedom to explore new things, ask new questions and discover potential opportunities. As the technology advanced companies started to advanced analytics on the stored data, various companies started to develop machine learning over the various sources stored in the data lake.
Implementing data lake grants fantastic opportunities for data teams to dig deep n stored data and discover potential business opportunities and act on them. However having a structured data lake will make things easier compared to having unstructured. Imagine you are looking for something in a very messy bedroom (like mine), it would be really hard to find out what you are looking for compared to a tidy bedroom.
Data Lake vs Data Warehouse
Data lakes and data warehouses both serves for different purposes. Every organisation is different so one might need a data warehouse while the other needs data lake. The common thing about the two is that they are both for storing data.
Data lake is a place where you can store all kinds of data, you can store unstructured data here. You can store the data here without designing it and you don’t need to know to what question this data will answer. You just store the data for the sake of using it in the future if needed. Data ware house is like more tidier cousin of data lake, the data stored here is structured and its stored for the specific purpose. Structure and the schema of the data here is set in advance so that it can help to get faster SQL queries.
How To Implement Data Lake?
Cleaning and structuring your data lake could actually prove harmful, unlike data warehouses where data is cleaned, enriched and structured, data lakes would not benefit from these actions because if you do cleaning in your data lake you might be removing a valuable data you would use in future. Let me put in a simple way, you are in a very messy room where everything is dumped there and you go in to clean things which seems to have no value for you and you dump these things to trash, later to discover that you have dumped very valuable items.
In order to avoid dumping valuable data there are few things to be done when you want to implement a data lake successfully.
1) Choosing Your Database Architecture
Its is important to choose the most suitable big data platform for your use case. You need to decide which kind of architecture works best for your business. The aim is to have a great understanding of your needs and efforts on your analytics.
If you want to learn more about how to choose your database architecture read our article from here.
2) Don’t Stash Everything Into Data Lake
Just because you can store data as much as you want, should not give you an impression to dump every data in to your data lake. Throwing everything into your data lake regardless of the structure and organisation of the data will make it harder for you to steer through the data lake, the more messy your data lake becomes higher the chances of valuable data will go to waste.
3) Have a Privacy and Security Protocols
It is important to have security on your data lake. Make sure that you define who can access to it and make changes on the data. This allow you to have control on your data and prevent any unnecessary changes that can lead to confusion in your data lake.
4) Record Changes
Keeping track of the changes is really important thing to do. Because people tend to dump everything in to data lake and it is really important to have an understanding relationship between your datasets. In order to have this you need to keep track of changes by having an activity log so that you can see who did what in your in your data lake. Keeping an activity log is really useful, especially if you have quite a few users who access to your data lake. Even though the data lakes are flexible and unstructured, it is really important to have a logical structure defined by clear goals.
5) Create a Documentation
You need to make a good documentation for the unstructured data, it should be like a handbook for your employees this will allow your team to navigate through your data lake more efficiently. In Rakam we created taxonomy future to prevent confusion
One possible problem is that you usually end up with hundreds of event types as you implement new features and remove some of them from your application. Some event types become absolute, some of them are crucial for your product managers and some of them are not easily understood at first glance.That’s why people prepare documents for event type definitions but since these documents are not integrated with your BI tools, they’re often not efficient and up-to-date. We created taxonomy feature in order to be able to address this problem. now, you can hide your event tables, add labels and write descriptions and categorize them so that every team member sees the event types in a structured way and understand what they stand for easily.