Executive Guide to Data Lakes: Best Practices

Data lakes have the potential to revolutionize your organization’s data warehouse and storage capabilities. However, to harness their full potential, it’s essential to avoid common pitfalls and embrace best practices. Based on our extensive experience building data lake architectures, we have identified several key insights and common mistakes. Here are the most critical ones to consider:

1. Develop Plausible Use Cases

While the allure of data lakes is strong, it’s crucial to have clear use cases that address specific business needs before implementation. Thorough use case analysis sets clear expectations for stakeholders and significantly boosts the chances of successful deployment and goal achievement.

2. Avoid Exporting Relational Databases to Files in the Data Lake

Ingesting normalized database tables like a snowflake schema into the data lake can cause performance issues and latency. Big data technologies are designed for large, flat, denormalized files, not small data files requiring foreign key joins. Focus on extracting and working with larger, flat files for better performance.

3. Prevent the Creation of Data Pond Silos or Data Swamps

Smaller repositories often emerge from different groups ingesting varied data or using different processing schemas. This fragmentation can lead to data silos and swamps, making it difficult to maintain data consistency, quality, and coherence. Avoid these pitfalls by planning for the big picture, agreeing on schemas, and maintaining detailed metadata. Train staff across departments to use and maintain a unified data structure.

4. Automate Installation and Configuration

Manual configuration management may seem quicker initially but doesn’t pay off in the long run. Use IT automation tools like Docker Machine, Rancher, or Chef to streamline maintenance and support. Embracing microservices architecture can further improve efficiency by allowing your data lake to be developed as a suite of small, independently deployable services.

5. Do Not Use RAID for Data Nodes

Hadoop and similar big data technologies already stripe data across multiple nodes for fault tolerance and performance. Adding RAID on DataNodes creates unnecessary duplication and degrades performance. Instead, use a JBOD (Just a Bunch of Disks) approach for DataNodes and reserve RAID for NameNodes, which are single points of failure.

6. Avoid Logical Volume Management for Data Nodes

Logical volume management can degrade performance by adding an extra layer between the filesystem and the disk. This is particularly problematic for DataNodes, where performance is critical. Stick to simpler disk management approaches and avoid unnecessary complexity.

7. Steer Clear of SAN/NAS for Data Nodes

Using SAN or NAS storage with Hadoop contradicts the best practice of moving computation to the data. SAN/NAS setups create single points of contention and increase network hops, leading to performance bottlenecks. Instead, use local disks on commodity hardware to keep computation close to the data.

8. Optimize Infrastructure with Containers and Bare Metal

While virtual machines and containers both have their uses, for optimal performance, DataNodes should be on bare metal. This approach maximizes performance while still allowing the use of containers and DevOps tools for other parts of the infrastructure. Many cloud vendors now offer bare metal options, which can provide the best of both worlds.

Conclusion

The debate between data lakes and enterprise data warehouses often misses the point that both have their unique strengths. Data lakes offer new and incremental value, particularly when starting without an existing central repository for data intelligence. However, they may lack some of the mature features of traditional data warehouses. A polyglot approach, integrating both data lakes and data warehouses, can often provide the best solution, leveraging the strengths of each.

Data lakes are proving to be invaluable tools, surpassing the traditional, manually curated data warehouses. They offer more extensive storage and processing capabilities at lower costs, adapting to the ever-changing data intelligence and analysis needs of users. Embrace the future of data storage and management by implementing these best practices in your data lake strategy.

This post concludes our series on data lakes. If you missed the earlier posts, here are the links:

← Executive Guide To Data Lakes: Warehouse Integration

Executive Guide to Data Lakes: Best Practices

1. Develop Plausible Use Cases

2. Avoid Exporting Relational Databases to Files in the Data Lake

3. Prevent the Creation of Data Pond Silos or Data Swamps

4. Automate Installation and Configuration

5. Do Not Use RAID for Data Nodes

6. Avoid Logical Volume Management for Data Nodes

7. Steer Clear of SAN/NAS for Data Nodes

8. Optimize Infrastructure with Containers and Bare Metal

Conclusion

Share

Categories

Related Posts

Executive Guide To Data Lakes: Warehouse Integration

Executive Guide To Data Lakes: The Value of Data Lakes