The Data Engineering Show Podcast | Free Listening on Podbean App

Revolutionizing Data Governance with DataStrato’s Unified Open Source Approach

Apr 8th, 2025 10:00 AM

In this episode of The Data Engineering Show, the bros sit with Lisa Cao, Product Manager at DataStrato, to explore data catalogs and Apache Gravitino, a unified metadata lake used to manage access and perform data governance for all data sources. What You’ll Learn:How Apache Gravitino differs from others like Unity catalog and Polaris by being able to support multiple catalog systems.What the “Push-Down Permission Management” security model is and how to implement it across different data systems. How to maintain consistent governance across various query engines like Spark, Trino, and Flink.Why interoperability, flexibility and open source ecosystem are becoming an important dynamics of data infrastructure rather than performance benchmarking.How to evaluate new data tools based on their real-world adoption rather than the social media hype.If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts instructions on how to do this here [insert link].Lisa Cao is a Product Manager at DataStrato, specializing in AI/ML product partnerships and developer relations. With deep expertise in data catalog technologies and open-source ecosystems, she plays a key role in developing Apache Gravitino, an ASF incubating project that provides a unified governance and security layer for diverse data systems. Her work in developing extensible catalog frameworks has helped organizations manage complex data environments across multiple platforms.Episode Highlights:What is Apache Gravitino? (01:24)Apache Gravitino is a meta-catalog that serves as a unified data governance and security layer used to manage different data systems. Lisa shares that Gravitino was the first to release an iceberg rest catalog and ended up open sourcing for the general community to use and as time passed, Polaris and Unity Catalog were also announced in open source. She highlights that although Gravitino, Polaris and Unity Catalog are very similar, Gravitino differs in that it is able to support multiple catalogs.Unifying AI/ML and Big Data Stack (03:15)One of the interesting things about Gravitino is that it offers more than just a catalog of data models and these model catalogs are the first step into looking at how to merge two worlds of AI and ML catalogs. Lisa shares the goal of effective management, that is, creating a system that can store and manage different types of data models, track changes to the models, and control access to the models.Simplifying Data Governance (10:49)Think of Gravitino as a “traffic cop” that helps to manage and secure data from multiple sources. It is crucial to have a system that provides unified access control across all data sources, allowing teams to manage access and data governance so that ML teams don't have to worry about access. Lisa says that Apache Gravitino is the system that makes data accessible to different teams and users while making sure that it is secure and governed appropriately. The Gravitino’s Query Engine Solution (21:34)Every query engine has its own way of managing data, which makes it difficult to switch between engines - you have to reconfigure everything. Lisa highlights that Gravitino solves the problem by providing a single layer of data governance that works across multiple query engines.Navigating the Fast-Paced World of Data Engineering (24:41)Lisa talks about how fast the data engineering space is moving and shares some insights to catching up;Don’t try to learn everything at once.Don't get too deep into every toolLook for real-world adoptionShe warns against the social media hype that can amplify the messaging around new tools, making it seem everyone is using it, when in reality, that can’t be easily seen.If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.Episode Resources:Apache Gravitino websiteFor Feedback & Discussions on Firebolt Core:Join Firebolt Discord CommunityJoin Firebolt GitHub DiscussionsFirebolt Core Github Repository Benjamin@Firebolt.ioThe Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

Database Technology in the Age of AI with DuckDB Labs co-creator Hannes Mühleisen

Mar 19th, 2025 11:00 AM

In this episode of The Data Engineering Show, host Benjamin and co-host Eldad sit with CEO DuckDB Labs and co-creator DuckDB, Hannes Mühleisen.Together, they:Talk about the journey of DuckDB, an open-source analytical database system designed as a universal wrangling tool.Explain how DuckDB differs from SQLite, highlighting the analytical and transactional use cases.Discuss DuckDB’s special feature and its approach to innovation including creating their Parquet Reader.Explore the simple and efficient ecosystem of DuckDB, allowing developers to add custom functionality without changing its core stability.Consider Hannes' perspective on the role of AI in databases.Delve into the system’s infrastructure, design choices and the dedication of the team to ensure a continuous, reliable database system.If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts, instructions on how to do this are [insert link].Hannes Mühleisen is the CEO of DuckDB Labs and a Professor in The Netherlands, renowned for co-creating DuckDB, an open-source analytical database system. With a background in database architecture and research from CWI database architectures group, he has pioneered the development of DuckDB as a universal data wrangling tool that can run everywhere from phones to space satellites. Under his leadership, DuckDB has achieved remarkable success, reaching 10 million downloads monthly and becoming a go-to solution for analytical database needs. His commitment to keeping DuckDB lightweight, portable, and hardware-agnostic while maintaining high performance has revolutionized how developers approach analytical database solutions. As both an academic and technology leader, Hannes brings unique insights into database architecture, open-source development, and the future of analytical data processing.Episode Highlights:The Purpose of DuckDB (01:04)Hannes gives a full description of what DuckDB is as well as what it is designed to do. He describes the tool as one that understands SQL and is specifically designed to simplify complex analytical use cases.SQLite vs DuckDB (02:53)Hannes compares two different tools stating that SQLite is an amazing system that is not meant for analytical queries but for transactional use cases while DuckDB is specifically designed for that exact purpose - analytical use cases. The Importance of Collaboration (08:14)Hannes states the need for community collaboration as the database engine space seems to have hundreds of brilliant people trying to solve the same problems. He shares his profound admiration for a team in Munich, praising them for their exploits in implementing concepts only described in paper.The Component-Based Architecture of DuckDB (11:25)Hannes highlights a special feature in DuckDB, that is, it can be used as a component and he explains that the in-process architecture is a success because of the memory of data sharing that can be achieved.The Parquet Reader Journey (17:51)Hannes explains how he built his Parquet Reader out of necessity, although he would have preferred not to. He shares how a creator named Ove Korn from Germany donated the reader to a project named “The Arrow Project” and managed it to the degree that the entire project depended on the use of the Parquet Reader and it became an issue to use both independently. Hannes adds that a parquet reader that is competent has no choice but to become a database engine which is one of the interesting things about development.The Role of AI in Database Interaction (22:41)Hannes states that he doesn’t think that AI has a place in a database engine but rather, it is needed for optimization because the researchers who built their careers on optimization are out of jobs. He explains that the role of AI should be for assistance tasks and not for a total execution.SQL - A Defined Interface (29:20)Hannes introduces us to a tool that allows us to pro-programmatically build a query called relational API stating that it helps to simplify the tasks of a programmer. Although, Hannes agrees that using a well-defined interface is important for components like databases, he also argues that SQL can provide a relatively defined behavior within a single system. The Golden Age of Database (38:57)Hannes concludes the episode by appreciating Firebolt and other engineers for taking on core engine tasks. He shares his excitement for the golden age of databases where there is a showcasing of what is possible.If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.Quotes:“DuckDB is a universal data wrangling tool. It is a relational data management system that speaks SQL designed to do well on analytical use cases.”“We call ourselves the SQLite for analytics because it explains the original design goal of DuckDB very well.”“Within the database engine space, we are all working to solve the same problems, and that's like, a hundred of us on the planet.”“It actually turns out in order to make a competent parquet reader, you do need query execution. There is just no way around it.”“I really like this golden age of databases we are in and personally, as somebody who really likes tables and SQL, I'm quite happy to see things like firebolt and others really working on core engine stuff.”For Feedback & Discussions on Firebolt Core:Join Firebolt Discord CommunityJoin Firebolt GitHub DiscussionsFirebolt Core Github Repository Benjamin@Firebolt.ioThe Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

AI and Data Movement: Trends and Best Practices with Estuary’s Daniel Pálma

Feb 11th, 2025 10:16 AM

In this episode of The Data Engineering Show, the bros sit with Daniel Pálma, Head of Marketing at Estuary.Join them as they:Talk about Daniel’s career transition from data engineering to marketing and how his background in data engineering has been a tremendous help to his marketing competence.Discuss the role of AI in the evolution of data movement ensuring a faster and easier process of creating data pipelines.Shine light on the challenges of vector databases and structured data in AI applications.Delve into the future of Apache Iceberg and data lakehouses, highlighting their current challenges.Shares insights on the golden age of data expressing the need for more data engineers, data analysts and data practitioners in the data space.If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts, instructions on how to do this are here.Daniel Pálma serves as Head of Marketing at Estuary, bringing a unique blend of technical expertise and marketing acumen to the data integration space. With nearly a decade of experience as a data engineer across startups, enterprises, and consulting roles, Daniel made a strategic pivot to marketing to help bridge the gap between complex technical solutions and their practical applications for data practitioners. His background in data engineering enables him to deeply understand the customers' challenges and create authentic, education-focused marketing content that resonates with technical audiences. Daniel’s thought leadership and content creation in the data engineering space, combined with his hands-on technical experience, positions him as a valuable voice in conversations about the evolution of data infrastructure and integration technologies. If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.For Feedback & Discussions on Firebolt Core:Join Firebolt Discord CommunityJoin Firebolt GitHub DiscussionsFirebolt Core Github Repository Benjamin@Firebolt.ioThe Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

AI and Data Change Management with Chad Sanderson, CEO Gable AI

Jan 7th, 2025 10:00 AM

In this episode of The Data Engineering Show, host Benjamin and co-host Eldad sit with Chad Sanderson, CEO and co-founder of Gable AI to explore the interesting world of data change management.Join them as they:Delve into challenges of data quality, how it degrades over time and the one-sided data quality checks on the “last mile” of the data supply chain.Talk about how Gable works through a 3-layer flow of technology which is to identify data production points, trace the data flow and communicate the impact of changes before they reach production.Explain why the gap between data producers and consumers need to be bridged and how Gable continues to emphasize the need for effective communication and understanding data change management across teamsShine light on how AI can enhance data management by extracting semantics from code and effectively manage the translation output.Discuss Chad’s vision for 2025 which is to help companies start to care about data and how the changes made to data affect other people.Chad Sanderson is the CEO and co-founder of Gable AI, a data change management platform. Chad has over a decade of experience in data engineering and infrastructure space, holding significant roles at major companies like Microsoft, Oracle, Sephora where he focused on data quality and governance challenges. He is a former Head of Data at Convoy, a LinkedIn writer, and a published author. He lives in Seattle, Washington, and is the Chief Operator of the Data Quality Camp. His journey from data scientist to data engineer and ultimately to CEO was driven by a desire to transform how organizations manage and utilize data. Gable AI addresses the complexities of the data supply chain, by providing tools for code scanning, data contracts and governance as code, enabling teams to proactively manage data changes and impact.If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are here.Episode ResourcesGable AI websiteChad Sanderson on LinkedInFor Feedback & Discussions on Firebolt Core:Join Firebolt Discord CommunityJoin Firebolt GitHub DiscussionsFirebolt Core Github Repository Benjamin@Firebolt.ioThe Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

Tech Stacks and Tradeoffs: Xudo's Founder on Picking the Right Tools for BI Success

Nov 26th, 2024 10:15 AM

Wouter Trappers is the founder of Xudo and shares his slightly unconventional path from philosopher to data consultant with the Bros in this latest episode of The Data Engineering Show. Wouter’s grounding in philosophy has proved to be a shaping influence on his approach to business intelligence. Much more than just a software solution, for Wouter, BI is all about change management and aligning leadership with data projects.They discuss:From Excel to Expert: From basic Excel tasks to a full mastery of BI tools like QlikView, Wouter has blended his technical and philosophical approaches to data to become a bona fide expert.Data Strategy as Transformation: Good change management principles have to be adhered to if a BI project is going to bear fruit. Focus on leadership alignment, KPI clarity, and user empowerment instead of simply implementing software. Challenges of Starting Small: Wouter has some tips to offer smaller companies around bootstrapping their data journey using existing tools, practical education, and even Gen AI.Balancing Scales: Smaller startups compared to large enterprises face a very different set of challenges.Wouter’s combination of philosophy and pragmatism brings fresh takes to building effective data solutions.The Data Engineering Show is brought to you by firebolt.io and handcrafted by our friends over at: fame.soPrevious guests include: Joseph Machado of Linkedin, Metthew Weingarten of Disney, Joe Reis and Matt Housely, authors of The Fundamentals of Data Engineering, Zach Wilson of Eczachly Inc, Megan Lieu of Deepnote, Erik Heintare of Bolt, Lior Solomon of Vimeo, Krishna Naidu of Canva, Mike Cohen of Substack, Jens Larsson of Ark, Gunnar Tangring of Klarna, Yoav Shmaria of Similarweb and Xiaoxu Gao of Adyen.Check out our three most downloaded episodes:Zach Wilson on What Makes a Great Data EngineerJoe Reis and Matt Housley on The Fundamentals of Data EngineeringBill Inmon, The Godfather of Data Warehousing

The Data Engineering Show

Episode List

You may also like

Get this podcast on your phone, Free

Create Your Podcast In Minutes

Podcast Services

MONETIZATION & MORE

KNOWLEDGE BASE

Support

Podbean