Execution Engine Configuration
DataStore can execute operations using different backends. This guide explains how to configure and optimize engine selection.
Available Engines
| Engine | Description | Best For |
|---|---|---|
auto | Automatically selects best engine per operation | General use (default) |
chdb | Forces all operations through ClickHouse SQL | Large datasets, aggregations |
pandas | Forces all operations through pandas | Compatibility testing, pandas-specific features |
Setting the Engine
Global Configuration
Checking Current Engine
Auto Mode
In auto mode (default), DataStore selects the optimal engine for each operation:
Operations Executed in chDB
- SQL-compatible filtering (
filter(),where()) - Column selection (
select()) - Sorting (
sort(),orderby()) - Grouping and aggregation (
groupby().agg()) - Joins (
join(),merge()) - Distinct (
distinct(),drop_duplicates()) - Limiting (
limit(),head(),tail())
Operations Executed in pandas
- Custom apply functions (
apply(custom_func)) - Complex pivot tables with custom aggregations
- Operations not expressible in SQL
- When input is already a pandas DataFrame
Example
chDB Mode
Force all operations through ClickHouse SQL:
When to Use
- Processing large datasets (millions of rows)
- Heavy aggregation workloads
- When you want maximum SQL optimization
- Consistent behavior across all operations
Performance Characteristics
| Operation Type | Performance |
|---|---|
| GroupBy/Aggregation | Excellent (up to 20x faster) |
| Complex Filtering | Excellent |
| Sorting | Very Good |
| Simple Single Filters | Good (slight overhead) |
Limitations
- Custom Python functions may not be supported
- Some pandas-specific features require conversion
pandas Mode
Force all operations through pandas:
When to Use
- Compatibility testing with pandas
- Using pandas-specific features
- Debugging pandas-related issues
- When data is already in pandas format
Performance Characteristics
| Operation Type | Performance |
|---|---|
| Simple Single Operations | Good |
| Custom Functions | Excellent |
| Complex Aggregations | Slower than chDB |
| Large Datasets | Memory intensive |
Cross-DataStore Engine
Configure the engine for operations that combine columns from different DataStores:
Example
Engine Selection Logic
Auto Mode Decision Tree
Function-Level Override
Some functions can have their engine explicitly configured:
See Function Config for details.
Performance Comparison
Benchmark results on 10M rows:
| Operation | pandas (ms) | chdb (ms) | Speedup |
|---|---|---|---|
| GroupBy count | 347 | 17 | 19.93x |
| Combined ops | 1,535 | 234 | 6.56x |
| Complex pipeline | 2,047 | 380 | 5.39x |
| Filter+Sort+Head | 1,537 | 350 | 4.40x |
| GroupBy agg | 406 | 141 | 2.88x |
| Single filter | 276 | 526 | 0.52x |
Key insights:
- chDB excels at aggregations and complex pipelines
- pandas is slightly faster for simple single operations
- Use
automode to get the best of both
Best Practices
1. Start with Auto Mode
2. Profile Before Forcing
3. Force Engine for Specific Workloads
4. Use explain() to Understand Execution
Troubleshooting
Issue: Operation slower than expected
Issue: Unsupported operation in chdb mode
Issue: Memory issues with large data
If you are running heavy aggregation workloads and don't need exact pandas output compatibility (row order, MultiIndex, dtype corrections), consider using Performance Mode. It automatically sets the engine to chdb and removes all pandas compatibility overhead.