Real World Example of a Real time Analytics pipeline using CockroachDB, Kafka and Apache Pinot/Startree.ai

Real-time analytics are becoming increasingly important for line of business transactional use cases such as fraud detection, real-time personalisation, and real-time inventory management. These use cases require the ability to process and analyse large volumes of data in real-time to make quick, informed decisions.

To meet this need, many organisations are turning to technologies such as CockroachDB, Kafka, and Apache Pinot. CockroachDB is a distributed SQL database that provides high availability and low latency for transactional workloads. Kafka is a distributed streaming platform that provides scalable, fault-tolerant data streaming. Apache Pinot is a real-time distributed analytics platform that provides low latency and high throughput for real-time analytics workloads.

In this blog post, we will explore how to use Changefeeds from CockroachDB, Kafka, and Pinot together to create a modern  real-time analytics pipeline for line of business transactional use cases using the latest and greatest technologies. We will discuss the benefits of each technology and provide a step-by-step guide for setting up and configuring the pipeline. By the end of this post, you will have a solid understanding of how to use these technologies to build a real-time analytics pipeline for your own use case.

The need for real-time analytics

Many line of business transactional use cases require real-time analytics to enable quick, informed decision-making. For example, in fraud detection, it is essential to identify and respond to fraudulent transactions in real-time to minimize financial losses. In real-time personalisation, it is important to provide relevant and personalised content to customers in real-time to improve their user experience. In real-time inventory management, it is critical to monitor inventory levels in real-time to avoid stockouts and overstocking.

Traditional batch processing and analytics tools may not be able to meet the requirements of these use cases. Batch processing typically involves processing data in large batches, which can result in delays of hours or even days. This delay can be a significant disadvantage for line of business transactional use cases that require quick action.

Real-time analytics, on the other hand, enables organisations to process and analyse data in real-time, allowing for quick decision-making and action. Real-time analytics can also help organisations identify patterns and anomalies in real-time, enabling them to respond quickly to changing conditions.

CockroachDB and Changefeeds

CockroachDB is a distributed SQL database that provides high availability and low latency for transactional workloads. CockroachDB also provides Changefeeds, a feature that allows the streaming of changes from CockroachDB to a sink, That could be a cloud storage sink or a tool such as Kafka or Red Panda

Changefeeds work by pushing any changes made to a table to a sink. Applications can then consume the stream and process the changes in real-time. Changefeeds can be configured to filter changes by columns, tables, or other criteria, allowing for fine-grained control over the data being streamed.

The benefits of using Changefeeds for real-time analytics are numerous. First and foremost, Changefeeds provide low latency and high availability, making them ideal for real-time use cases. Additionally, Changefeeds can be used to capture changes to a table or database as they happen, ensuring that the data being analysed is always up-to-date. Finally, Changefeeds allow for fine-grained control over the data being streamed, enabling organisations to filter and process only the data that is relevant to their use case.

Kafka and CockroachDB Changefeeds

Kafka is a distributed streaming platform that provides scalable, fault-tolerant data streaming. Kafka is often used as a data pipeline to move data between systems and applications, making it an ideal choice for streaming data from CockroachDB Changefeeds.

To stream data from CockroachDB Changefeeds to Kafka, CockroachDB has a built in Kafka producer called a changefeed that can connect directly to a Kafka broker Organisations can then use Kafka to consume the data and perform additional processing, such as data transformation or aggregation.

The benefits of using Kafka with CockroachDB Changefeeds are significant. First and foremost, Kafka provides a scalable, fault-tolerant data streaming platform that can handle high volumes of data. Additionally, Kafka provides a high degree of flexibility and can be used to move data between different systems and applications, making it an ideal choice for integrating with other data processing tools.

Apache Pinot

Apache Pinot is a real-time distributed analytics platform that provides low latency and high throughput for real-time analytics workloads. Pinot is designed to handle high volumes of data in real-time, making it an ideal choice for analysing data that is being streamed from Kafka.

To use Pinot with Kafka, organisations can use the Pinot Kafka Indexing Service. This service reads data from Kafka topics and writes it to Pinot tables. Organisations can then use Pinot to query the data and perform real-time analytics.

The benefits of using Pinot for real-time analytics are significant. Pinot provides low latency and high throughput, allowing organisations to perform real-time analytics on large volumes of data. Additionally, Pinot is highly scalable and fault-tolerant, making it an ideal choice for mission-critical applications.

Setting up the Real-time Analytics Pipeline

To set up the real-time analytics pipeline using CockroachDB Changefeeds, Kafka, and Pinot, organisations can follow these steps:

  1. Set up a CockroachDB cluster: Organisations can set up a CockroachDB cluster with at least three nodes for high availability and data redundancy.
  2. Enable Changefeeds: Organisations can enable Changefeeds for the tables or databases that they want to stream data from.
  3. Set up a Kafka cluster or Red panda cluster: Organisations can set up a Kafka cluster with at least three nodes for high availability and data redundancy.
  4. Configure CockroachDB Changefeeds: Set up CockroachDB changefeeds to connect stream data to the kafka cluster
  5. Set up the Pinot Kafka Indexing Service: Organisations can set up the Pinot Kafka Indexing Service to read data from Kafka topics and write it to Pinot tables.
  6. Configure Pinot: Organisations can configure Pinot to create the necessary tables and schemas for the data being streamed from Kafka.
  7. Start the real-time analytics pipeline: Once everything is set up and configured, organisations can start the real-time analytics pipeline by streaming data from CockroachDB Changefeeds to Kafka and then to Pinot.

The real-time analytics pipeline can then be used to perform real-time analytics on the data being streamed from CockroachDB. Organisations can use Pinot to perform queries and analytics on the data, providing quick insights for line of business transactional use cases.

Using Changefeeds from CockroachDB, Kafka, and Pinot can provide organisations with a real-time analytics pipeline that can meet the needs of line of business transactional use cases. The pipeline provides low latency, high availability, and high scalability, making it an ideal choice for mission-critical applications.

Benefits of Using the Real-time Analytics Pipeline

The real-time analytics pipeline using Changefeeds from CockroachDB, Kafka, and Pinot provides numerous benefits for line of business transactional use cases. Some of the benefits include:

  1. Real-time analytics: The pipeline provides real-time analytics capabilities, allowing organisations to quickly analyze data as it is being generated.
  2. Low latency: Pinot provides low latency queries, allowing organisations to quickly retrieve data and perform real-time analytics.
  3. High availability: The pipeline is designed to be highly available, with built-in redundancy and failover capabilities.
  4. Scalability: The pipeline is designed to be highly scalable, allowing organisations to process large volumes of data in real-time.
  5. Flexibility: Kafka provides a flexible data pipeline that can integrate with a wide range of systems and applications.
  6. Easy to set up and use: The pipeline can be set up and configured relatively easily, allowing organisations to quickly start streaming data and performing real-time analytics.
  7. Cost-effective: The pipeline is cost-effective compared to traditional data warehousing solutions, as it uses open-source technologies and can be run on commodity hardware.

Overall, the real-time analytics pipeline using Changefeeds from CockroachDB, Kafka, and Pinot provides organisations with a powerful and flexible platform for real-time analytics. By leveraging this pipeline, organisations can gain insights into their data faster and make more informed decisions, ultimately leading to improved business outcomes.