CockroachDB and Analytic Workloads - Part 2

In an earlier post we discussed how CockroachDB can indeed be used for Analytics workloads. If you are interested in reading that article its available here

In this second post we will give a short refresher on why CockroachDB can be a good fit for analytics and then provide an example of a simple and a more complex query to prove how we can utilise the database to deliver on our analytics workloads.


Small Refresher


As you probably know by now, CockroachDB is a distributed SQL database that can offer you several features that can help with your analytics. Lets expand on some of them here;

  1. Horizontal scalability: CockroachDB can scale horizontally across multiple nodes and data centres, making it ideal for storing and processing large amounts of data. This allows analytics to be performed on vast amounts of data in parallel, reducing processing time and enabling faster decision making.
  2. Multi-active availability: CockroachDB can replicate data across multiple nodes in real-time, allowing for high availability and increased data resiliency. This ensures that data is always available for analysis, even in the event of node failures.
  3. SQL support: CockroachDB is a SQL database, which means that it can be easily integrated with popular data analysis tools such as Apache Spark, Tableau, Qlik Cloud, and Microsoft Power BI. This allows analysts to use familiar tools to analyse data stored in CockroachDB.
  4. Distributed transactions: CockroachDB supports distributed transactions, which allows for ACID compliance across multiple nodes. This ensures that data is always consistent and accurate, even during complex analytical operations.
  5. Geospatial data support: CockroachDB supports geospatial data types and indexing, making it well-suited for analysing location-based data. This can be useful for applications such as tracking customer movements, analysing delivery routes, and visualising real-time data on a map.

A simple example using Python and CockroachDB


Here is an example code snippet in Python for connecting to CockroachDB and performing some simple analytics using the Pandas library:

This code connects to a CockroachDB cluster using the psycopg2 library, performs a SQL query to retrieve data from a table, loads the results into a Pandas dataframe, and performs some simple analytics on the dataframe.

The mean, max, and min values of a specific column are calculated and printed to the console. Finally, the database connection is closed.

This is just a simple example, but CockroachDB's support for SQL and distributed transactions enables much more complex analytics to be performed.

As mentioned above we will now provide a more complex example using some commonly used analytics functions


You want more complex?


In this example analytical query that we utilise CockroachDB's support for window functions and aggregation:

This query first selects all orders that were placed within the last 30 days using the NOW() function and the INTERVAL keyword. It then groups the orders by customer ID and calculates the total spending and average spending per order using the SUM() and AVG() aggregation functions.

The query also uses a window function, ROW_NUMBER() OVER (ORDER BY SUM(order_amount) DESC), to assign a rank to each customer based on their total spending, with the highest spender receiving a rank of 1. Finally, the outer query selects only the top 5 customers based on their rank, and sorts them by total spending in descending order.

This query is just an example of the kinds of complex analytical queries that can be performed using CockroachDB's support for window functions and aggregation, and it can be easily customised to fit a wide range of analytical use cases.


Thanks for reading and look out soon for the next post on how CockroachDB can help you with your analytics workloads