Introduction to Recursive Queries in Polars
As data engineers and analysts, we often encounter complex hierarchical structures in our data. Oracle’s hierarchical queries are a great example of this. However, when working with Polars, a modern open-source DataFrame library, we need to rewrite these queries to accommodate its different architecture.
In this article, we will explore how to rewrite Oracle’s hierarchical query using Polars. We’ll cover the basics of recursive queries in Polars and provide an example implementation.
Understanding Recursive Queries
Recursive queries are used to traverse hierarchical structures in data. They allow us to process each level of the hierarchy, from the topmost level (the root) down to the lowest level (the leaves). In the context of Oracle’s hierarchical query, we’re dealing with a tree-like structure where each node has a parent-child relationship.
What is Polars?
Polars is an open-source DataFrame library developed by the Microsoft AI Research Group. It provides a modern and efficient way to work with structured data in Python. Polars offers various features that simplify data analysis and manipulation, including:
- Vectorized Operations: Perform operations on entire vectors of data at once.
- Lazy Evaluation: Evaluate expressions only when needed.
- Memory Efficiency: Efficient memory usage for large datasets.
Converting Oracle’s Hierarchical Query to Polars
To convert Oracle’s hierarchical query to Polars, we’ll break down the process into smaller steps. We’ll start by defining our data structures and then create a recursive function that traverses the hierarchy.
Step 1: Define Data Structures
We have two main data structures:
prod: A DataFrame containing the product IDs and their corresponding codes.prre: A DataFrame containing the detail product IDs, master product IDs, and their relationships.
## Importing Libraries
import polars as pl
## Defining Data Structures
prod = pl.DataFrame(
{'id': [10,11,12,13,14,15,16],
'code': ['1008','1582','1583','2023','2025','2030','2222']}
)
prre = pl.DataFrame(
{'detail_product_id': [10,12,91,14,90,11,15,13,94],
'master_product_id': [90,11,92,12,93,91,12,12,95]}
)
Step 2: Creating a Recursive Function
We’ll create a recursive function that takes the prod DataFrame and the prre DataFrame as input. The function will iterate over each row in prod, filter for rows with specific codes, and then join with prre to get the next level of hierarchy.
## Creating Recursive Function
def recursive_query(prod_df, prre_df):
# Filter for initial code values
df = (
prod_df
.filter(pl.col.code.is_in(['1008', '1582']))
.with_columns(group_type = pl.col.code)
)
dfs = [df]
while True:
# Join with prre to get next level of hierarchy
if len(dfs[-1].height) == 0:
break
df = (
prod_df
.join(df, left_on="id", right_on="detail_product_id")
.select("id", "code", "group_type")
)
dfs.append(df)
# Concatenate all dataframes and drop id column
result_df = pl.concat(dfs).drop("id")
return result_df
# Call the recursive function
result_df = recursive_query(prod, prre)
print(result_df)
Alternative Approach using DuckDB Integration with Polars
We can also use DuckDB’s integration with Polars to achieve a similar result. Instead of creating a custom recursive function, we can leverage DuckDB’s built-in support for recursive CTEs.
## Using DuckDB Integration with Polars
import duckdb
# Create a DuckDB connection
conn = duckdb.connect()
# Create a query that uses a recursive CTE
query = """
with recursive cte as (
select code, code as group_type, id, 1 as depth
from prod
where code in ('1008', '1582')
union all
select d.code, c.group_type, d.id, c.depth + 1
from prre as p
inner join prod as d on
d.id = p.master_product_id
inner join cte as c on
c.id = p.detail_product_id
)
select code, group_type, depth
from cte
"""
# Execute the query using DuckDB
result_df = conn.query(query)
print(result_df)
Conclusion
Recursive queries in Polars can be used to traverse hierarchical structures in data. By leveraging vectorized operations and lazy evaluation, we can perform efficient data analysis and manipulation.
In this article, we explored how to convert Oracle’s hierarchical query using Polars. We covered the basics of recursive queries, defined our data structures, created a custom recursive function, and provided an alternative approach using DuckDB integration with Polars.
Last modified on 2024-09-12