Execution Engine Configuration

DataStore can execute operations using different backends. This guide explains how to configure and optimize engine selection.

Available Engines

Engine	Description	Best For
`auto`	Automatically selects best engine per operation	General use (default)
`chdb`	Forces all operations through ClickHouse SQL	Large datasets, aggregations
`pandas`	Forces all operations through pandas	Compatibility testing, pandas-specific features

Setting the Engine

Global Configuration

from chdb.datastore.config import config

# Option 1: Using set method
config.set_execution_engine('auto')    # Default
config.set_execution_engine('chdb')    # Force ClickHouse
config.set_execution_engine('pandas')  # Force pandas

# Option 2: Using shortcuts
config.use_auto()     # Auto-select
config.use_chdb()     # Force ClickHouse
config.use_pandas()   # Force pandas

Checking Current Engine

print(config.execution_engine)  # 'auto', 'chdb', or 'pandas'

Auto Mode

In auto mode (default), DataStore selects the optimal engine for each operation:

Operations Executed in chDB

SQL-compatible filtering (filter(), where())
Column selection (select())
Sorting (sort(), orderby())
Grouping and aggregation (groupby().agg())
Joins (join(), merge())
Distinct (distinct(), drop_duplicates())
Limiting (limit(), head(), tail())

Operations Executed in pandas

Custom apply functions (apply(custom_func))
Complex pivot tables with custom aggregations
Operations not expressible in SQL
When input is already a pandas DataFrame

Example

from chdb import datastore as pd
from chdb.datastore.config import config

config.use_auto()  # Default

ds = pd.read_csv("data.csv")

# This uses chDB (SQL)
result = (ds
    .filter(ds['amount'] > 100)   # SQL: WHERE
    .groupby('region')            # SQL: GROUP BY
    .agg({'amount': 'sum'})       # SQL: SUM()
)

# This uses pandas (custom function)
result = ds.apply(lambda row: complex_calculation(row), axis=1)

chDB Mode

Force all operations through ClickHouse SQL:

config.use_chdb()

When to Use

Processing large datasets (millions of rows)
Heavy aggregation workloads
When you want maximum SQL optimization
Consistent behavior across all operations

Performance Characteristics

Operation Type	Performance
GroupBy/Aggregation	Excellent (up to 20x faster)
Complex Filtering	Excellent
Sorting	Very Good
Simple Single Filters	Good (slight overhead)

Limitations

Custom Python functions may not be supported
Some pandas-specific features require conversion

pandas Mode

Force all operations through pandas:

config.use_pandas()

When to Use

Compatibility testing with pandas
Using pandas-specific features
Debugging pandas-related issues
When data is already in pandas format

Performance Characteristics

Operation Type	Performance
Simple Single Operations	Good
Custom Functions	Excellent
Complex Aggregations	Slower than chDB
Large Datasets	Memory intensive

Cross-DataStore Engine

Configure the engine for operations that combine columns from different DataStores:

# Set cross-DataStore engine
config.set_cross_datastore_engine('auto')
config.set_cross_datastore_engine('chdb')
config.set_cross_datastore_engine('pandas')

Example

ds1 = pd.read_csv("sales.csv")
ds2 = pd.read_csv("inventory.csv")

# This operation involves two DataStores
result = ds1.join(ds2, on='product_id')
# Uses cross_datastore_engine setting

Engine Selection Logic

Auto Mode Decision Tree

Operation requested
    │
    ├─ Can be expressed in SQL?
    │      │
    │      ├─ Yes → Use chDB
    │      │
    │      └─ No → Use pandas
    │
    └─ Cross-DataStore operation?
           │
           └─ Use cross_datastore_engine setting

Function-Level Override

Some functions can have their engine explicitly configured:

from chdb.datastore.config import function_config

# Force specific functions to use specific engine
function_config.use_chdb('length', 'substring')
function_config.use_pandas('upper', 'lower')

See Function Config for details.

Performance Comparison

Benchmark results on 10M rows:

Operation	pandas (ms)	chdb (ms)	Speedup
GroupBy count	347	17	19.93x
Combined ops	1,535	234	6.56x
Complex pipeline	2,047	380	5.39x
Filter+Sort+Head	1,537	350	4.40x
GroupBy agg	406	141	2.88x
Single filter	276	526	0.52x

Key insights:

chDB excels at aggregations and complex pipelines
pandas is slightly faster for simple single operations
Use auto mode to get the best of both

Best Practices

1. Start with Auto Mode

config.use_auto()  # Let DataStore decide

2. Profile Before Forcing

config.enable_profiling()
# Run your workload
# Check profiler report to see where time is spent

3. Force Engine for Specific Workloads

# For heavy aggregation workloads
config.use_chdb()

# For pandas compatibility testing
config.use_pandas()

4. Use explain() to Understand Execution

ds = pd.read_csv("data.csv")
query = ds.filter(ds['age'] > 25).groupby('city').agg({'salary': 'sum'})

# See what SQL will be generated
query.explain()

Troubleshooting

Issue: Operation slower than expected

# Check current engine
print(config.execution_engine)

# Enable debug to see what's happening
config.enable_debug()

# Try forcing specific engine
config.use_chdb()  # or config.use_pandas()

Issue: Unsupported operation in chdb mode

# Some pandas operations aren't supported in SQL
# Solution: use auto mode
config.use_auto()

# Or explicitly convert to pandas first
df = ds.to_df()
result = df.some_pandas_specific_operation()

Issue: Memory issues with large data

# Use chdb engine to avoid loading all data into memory
config.use_chdb()

# Filter early to reduce data size
result = ds.filter(ds['date'] >= '2024-01-01').to_df()

# For maximum throughput on large datasets, use performance mode
# which enables parallel Parquet reading and single-SQL aggregation
config.use_performance_mode()

Performance Mode

If you are running heavy aggregation workloads and don't need exact pandas output compatibility (row order, MultiIndex, dtype corrections), consider using Performance Mode. It automatically sets the engine to chdb and removes all pandas compatibility overhead.

Available Engines​

Setting the Engine​

Global Configuration​

Checking Current Engine​

Auto Mode​

Operations Executed in chDB​

Operations Executed in pandas​

Example​

chDB Mode​

When to Use​

Performance Characteristics​

Limitations​

pandas Mode​

When to Use​

Performance Characteristics​

Cross-DataStore Engine​

Example​

Engine Selection Logic​

Auto Mode Decision Tree​

Function-Level Override​

Performance Comparison​

Best Practices​

1. Start with Auto Mode​

2. Profile Before Forcing​

3. Force Engine for Specific Workloads​

4. Use explain() to Understand Execution​

Troubleshooting​

Issue: Operation slower than expected​

Issue: Unsupported operation in chdb mode​

Issue: Memory issues with large data​

Available Engines

Setting the Engine

Global Configuration

Checking Current Engine

Auto Mode

Operations Executed in chDB

Operations Executed in pandas

Example

chDB Mode

When to Use

Performance Characteristics

Limitations

pandas Mode

When to Use

Performance Characteristics

Cross-DataStore Engine

Example

Engine Selection Logic

Auto Mode Decision Tree

Function-Level Override

Performance Comparison

Best Practices

1. Start with Auto Mode

2. Profile Before Forcing

3. Force Engine for Specific Workloads

4. Use explain() to Understand Execution

Troubleshooting

Issue: Operation slower than expected

Issue: Unsupported operation in chdb mode

Issue: Memory issues with large data