Is it possible to build a real-time data platform without using stateful stream processing? Forecasty.ai is an artificial intelligence platform for forecasting commodity prices, imparting insights into the future valuations of raw materials for users. Nearly all AI models are batch-trained once, but precious commodities are linked to ever-fluctuating global financial markets, which require real-time insights. In this episode, Ralph Debusmann (CTO, Forecasty.ai) shares their journey of migrating from a batch machine learning platform to a real-time event streaming system with Apache Kafka® and delves into their approach to making the transition frictionless.
Ralph explains that Forecasty.ai was initially built on top of batch processing, however, updating the models with batch-data syncs was costly and environmentally taxing. There was also the question of scalability—progressing from 60 commodities on offer to their eventual plan of over 200 commodities. Ralph observed that most real-time systems are non-batch, streaming-based real-time data platforms with stateful stream processing, using Kafka Streams, Apache Flink®, or even Apache Samza. However, stateful stream processing involves resources, such as teams of stream processing specialists to solve the task.
With the existing team, Ralph decided to build a real-time data platform without using any sort of stateful stream processing. They strictly keep to the out-of-the-box components, such as Kafka topics, Kafka Producer API, Kafka Consumer API, and other Kafka connectors, along with a real-time database to process data streams and implement the necessary joins inside the database.
Additionally, Ralph shares the tool he built to handle historical data, kash.py—a Kafka shell based on Python; discusses issues the platform needed to overcome for success, and how they can make the migration from batch processing to stream processing painless for the data science team.
EPISODE LINKS
Building Real-Time Data Governance at Scale with Apache Kafka ft. Tushar Thole
Handling 2 Million Apache Kafka Messages Per Second at Honeycomb
Why Data Mesh? ft. Ben Stopford
Serverless Stream Processing with Apache Kafka ft. Bill Bejeck
The Evolution of Apache Kafka: From In-House Infrastructure to Managed Cloud Service ft. Jay Kreps
What’s Next for the Streaming Audio Podcast ft. Kris Jenkins
On to the Next Chapter ft. Tim Berglund
Intro to Event Sourcing with Apache Kafka ft. Anna McDonald
Expanding Apache Kafka Multi-Tenancy for Cloud-Native Systems ft. Anna Povzner and Anastasia Vela
Apache Kafka 3.1 - Overview of Latest Features, Updates, and KIPs
Optimizing Cloud-Native Apache Kafka Performance ft. Alok Nikhil and Adithya Chandra
From Batch to Real-Time: Tips for Streaming Data Pipelines with Apache Kafka ft. Danica Fine
Real-Time Change Data Capture and Data Integration with Apache Kafka and Qlik
Modernizing Banking Architectures with Apache Kafka ft. Fotios Filacouris
Running Hundreds of Stream Processing Applications with Apache Kafka at Wise
Lessons Learned From Designing Serverless Apache Kafka ft. Prachetaa Raghavan
Using Apache Kafka as Cloud-Native Data System ft. Gwen Shapira
ksqlDB Fundamentals: How Apache Kafka, SQL, and ksqlDB Work Together ft. Simon Aubury
Explaining Stream Processing and Apache Kafka ft. Eugene Meidinger
Handling Message Errors and Dead Letter Queues in Apache Kafka ft. Jason Bell
Create your
podcast in
minutes
It is Free
Insight Story: Tech Trends Unpacked
Zero-Shot
Fast Forward by Tomorrow Unlocked: Tech past, tech future
Black Wolf Feed (Chapo Premium Feed Bootleg)
Bannon`s War Room