Recursive Queries in Polars: A Modern Approach to Hierarchical Data Analysis

Introduction to Recursive Queries in Polars

As data engineers and analysts, we often encounter complex hierarchical structures in our data. Oracle’s hierarchical queries are a great example of this. However, when working with Polars, a modern open-source DataFrame library, we need to rewrite these queries to accommodate its different architecture.

In this article, we will explore how to rewrite Oracle’s hierarchical query using Polars. We’ll cover the basics of recursive queries in Polars and provide an example implementation.

Understanding Recursive Queries

Recursive queries are used to traverse hierarchical structures in data. They allow us to process each level of the hierarchy, from the topmost level (the root) down to the lowest level (the leaves). In the context of Oracle’s hierarchical query, we’re dealing with a tree-like structure where each node has a parent-child relationship.

What is Polars?

Polars is an open-source DataFrame library developed by the Microsoft AI Research Group. It provides a modern and efficient way to work with structured data in Python. Polars offers various features that simplify data analysis and manipulation, including:

  • Vectorized Operations: Perform operations on entire vectors of data at once.
  • Lazy Evaluation: Evaluate expressions only when needed.
  • Memory Efficiency: Efficient memory usage for large datasets.

Converting Oracle’s Hierarchical Query to Polars

To convert Oracle’s hierarchical query to Polars, we’ll break down the process into smaller steps. We’ll start by defining our data structures and then create a recursive function that traverses the hierarchy.

Step 1: Define Data Structures

We have two main data structures:

  • prod: A DataFrame containing the product IDs and their corresponding codes.
  • prre: A DataFrame containing the detail product IDs, master product IDs, and their relationships.
## Importing Libraries
import polars as pl

## Defining Data Structures
prod = pl.DataFrame(
    {'id': [10,11,12,13,14,15,16],
     'code': ['1008','1582','1583','2023','2025','2030','2222']}
)

prre = pl.DataFrame(
    {'detail_product_id': [10,12,91,14,90,11,15,13,94],
     'master_product_id': [90,11,92,12,93,91,12,12,95]}
)

Step 2: Creating a Recursive Function

We’ll create a recursive function that takes the prod DataFrame and the prre DataFrame as input. The function will iterate over each row in prod, filter for rows with specific codes, and then join with prre to get the next level of hierarchy.

## Creating Recursive Function
def recursive_query(prod_df, prre_df):
    # Filter for initial code values
    df = (
        prod_df
        .filter(pl.col.code.is_in(['1008', '1582']))
        .with_columns(group_type = pl.col.code)
    )
    
    dfs = [df]
    
    while True:
        # Join with prre to get next level of hierarchy
        if len(dfs[-1].height) == 0:
            break
        
        df = (
            prod_df
            .join(df, left_on="id", right_on="detail_product_id")
            .select("id", "code", "group_type")
        )
        
        dfs.append(df)
    
    # Concatenate all dataframes and drop id column
    result_df = pl.concat(dfs).drop("id")
    
    return result_df

# Call the recursive function
result_df = recursive_query(prod, prre)

print(result_df)

Alternative Approach using DuckDB Integration with Polars

We can also use DuckDB’s integration with Polars to achieve a similar result. Instead of creating a custom recursive function, we can leverage DuckDB’s built-in support for recursive CTEs.

## Using DuckDB Integration with Polars
import duckdb

# Create a DuckDB connection
conn = duckdb.connect()

# Create a query that uses a recursive CTE
query = """
    with recursive cte as (
        select code, code as group_type, id, 1 as depth
        from prod
        where code in ('1008', '1582')

        union all

        select d.code, c.group_type, d.id, c.depth + 1
        from prre as p
            inner join prod as d on
                d.id = p.master_product_id
            inner join cte as c on
                c.id = p.detail_product_id
    )
    select code, group_type, depth
    from cte
"""

# Execute the query using DuckDB
result_df = conn.query(query)

print(result_df)

Conclusion

Recursive queries in Polars can be used to traverse hierarchical structures in data. By leveraging vectorized operations and lazy evaluation, we can perform efficient data analysis and manipulation.

In this article, we explored how to convert Oracle’s hierarchical query using Polars. We covered the basics of recursive queries, defined our data structures, created a custom recursive function, and provided an alternative approach using DuckDB integration with Polars.


Last modified on 2024-09-12