Is it possible to build a real-time data platform without using stateful stream processing? Forecasty.ai is an artificial intelligence platform for forecasting commodity prices, imparting insights into the future valuations of raw materials for users. Nearly all AI models are batch-trained once, but precious commodities are linked to ever-fluctuating global financial markets, which require real-time insights. In this episode, Ralph Debusmann (CTO, Forecasty.ai) shares their journey of migrating from a batch machine learning platform to a real-time event streaming system with Apache Kafka® and delves into their approach to making the transition frictionless.
Ralph explains that Forecasty.ai was initially built on top of batch processing, however, updating the models with batch-data syncs was costly and environmentally taxing. There was also the question of scalability—progressing from 60 commodities on offer to their eventual plan of over 200 commodities. Ralph observed that most real-time systems are non-batch, streaming-based real-time data platforms with stateful stream processing, using Kafka Streams, Apache Flink®, or even Apache Samza. However, stateful stream processing involves resources, such as teams of stream processing specialists to solve the task.
With the existing team, Ralph decided to build a real-time data platform without using any sort of stateful stream processing. They strictly keep to the out-of-the-box components, such as Kafka topics, Kafka Producer API, Kafka Consumer API, and other Kafka connectors, along with a real-time database to process data streams and implement the necessary joins inside the database.
Additionally, Ralph shares the tool he built to handle historical data, kash.py—a Kafka shell based on Python; discusses issues the platform needed to overcome for success, and how they can make the migration from batch processing to stream processing painless for the data science team.
EPISODE LINKS
Confluent Platform 7.0: New Features + Updates
Real-Time Stream Processing with Kafka Streams ft. Bill Bejeck
Automating Infrastructure as Code with Apache Kafka and Confluent ft. Rosemary Wang
Getting Started with Spring for Apache Kafka ft. Viktor Gamov
Powering Event-Driven Architectures on Microsoft Azure with Confluent
Automating DevOps for Apache Kafka and Confluent ft. Pere Urbón-Bayes
Intro to Kafka Connect: Core Components and Architecture ft. Robin Moffatt
Designing a Cluster Rollout Management System for Apache Kafka ft. Twesha Modi
Apache Kafka 3.0 - Improving KRaft and an Overview of New Features
How to Build a Strong Developer Community with Global Engagement ft. Robin Moffatt and Ale Murray
What Is Data Mesh, and How Does it Work? ft. Zhamak Dehghani
Multi-Cluster Apache Kafka with Cluster Linking ft. Nikhil Bhatia
Using Apache Kafka and ksqlDB for Data Replication at Bolt
Placing Apache Kafka at the Heart of a Data Revolution at Saxo Bank
Advanced Stream Processing with ksqlDB ft. Michael Drogalis
Minimizing Software Speciation with ksqlDB and Kafka Streams ft. Mitch Seymour
Collecting Data with a Custom SIEM System Built on Apache Kafka and Kafka Connect ft. Vitalii Rudenskyi
Consistent, Complete Distributed Stream Processing ft. Guozhang Wang
Powering Real-Time Analytics with Apache Kafka and Rockset
Automated Event-Driven Architectures and Microservices with Apache Kafka and SmartBear
Create your
podcast in
minutes
It is Free
Insight Story: Tech Trends Unpacked
Zero-Shot
Fast Forward by Tomorrow Unlocked: Tech past, tech future
Black Wolf Feed (Chapo Premium Feed Bootleg)
Bannon`s War Room