Open-Source Data Catalog Amundsen with Mark Grover @ Stemma
In this episode of Building The Backend we hear from Mark Grover founder @ Stemma, co-creator of Amundsen. Stemma is a fully managed data catalog, powered by the leading open-source data catalog, Amundsen.Below are top 3 value bombs: Automated data catalogs are critical to help wrangle the growing data across organizations. (i.e. Being able to identify out of 150 columns on this table only 10 are being used downstream)Tribal knowledge and context cannot be automated - data catalogs cannot be 100% automated. Amundsen is an open-source data catalog originally created at Lyft. Stemma has created a managed version of Amundsen. Help me improve the podcast by completing this 60 second survey: https://buildingthebackend.com/survey
Architecting a Modern Data Lake with Dipti Borkar from Ahana
In this episode of Building The Backend we hear from Dipti Borkar cofounder @ Ahana a managed service for Presto on AWS, where we talk all about the data lake, how it should be structured and where the industry is going. Below are top 3 value bombs: Presto is an open source distributed SQL query engine originally created by Facebook, mainly used to run SQL queries on data lakes but can be connected to relational data stores as well. Ahana is a managed Presto service on AWS with 3x price/performance. When optimizing your data lake, it’s normally best to store the data in Parquet or ORC format vs JSON or CSV as they are columnar formats that can have indexes built in. Data Lake Houses are continuing to gain popularity by bringing the benefits of your data lake and data warehouse together with the help of tools like Databricks DeltaLake and Apache HUDI.
Open Source BI with Apache Superset
What tools are you using for data viz? Are they low cost? One option is Apache Superset, in this episode we speak with Robert Stolz to learn more about Superset and other open source data tools. Top 3 Value Bombs: One popular use case with Apache Superset is embedding it within applications because it’s open source, there is a wide range of flexibility to integrate it with existing systems. Apache Superset supports any sources supported by the Python SQL toolkit called SQLAlchemy. DBT encourages a set of best practices around data development (i.e. source control and test driven development).
Edge Computing and Continuous Intelligence with Swim
In this episode of Building The Backend we hear from Simon Crosby – CTO @ Swim an open source edge computing operating system, where we talk all about edge computing, event streaming and much more. Below are top 3 value bombs: Edge means more than just being physically located somewhere it could also mean in the cloud. It really is the closest point of where your source data is being generated.Continuous intelligence is a design pattern where streaming data is directly tied into business operations. Kafka is continuing to hold it’s strong position in the event streaming space.
12 Modern Data Architecture Principles That Should Be Implemented in 2022
This episode is a little different then the usual format. Instead of interviewing a data leader - I share what I consider are the 12 most important principles when designing a modern data architecture. Please message me on LinkedIn with the thoughts on this show.