DataStore delivers significant performance improvements over pandas for many operations. This guide explains why and how to optimize your workloads.
Why DataStore Is Faster
1. SQL Pushdown
Operations are pushed down to the data source:
# pandas: Loads ALL data, then filters in memory
df = pd.read_csv("huge.csv") # Load 10GB
df = df[df['year'] == 2024] # Filter in Python
# DataStore: Filter at source
ds = pd.read_csv("huge.csv") # Just metadata
ds = ds[ds['year'] == 2024] # Filter in SQL
df = ds.to_df() # Only load filtered data
2. Column Pruning
Only needed columns are read:
# DataStore: Only reads name, age columns
ds = pd.read_parquet("wide_table.parquet")
result = ds.select('name', 'age').to_df()
# vs pandas: Reads all 100 columns, then selects
3. Lazy Evaluation
Multiple operations compile to one query:
# DataStore: One optimized SQL query
result = (ds
.filter(ds['amount'] > 100)
.groupby('region')
.agg({'amount': 'sum'})
.sort('sum', ascending=False)
.head(10)
.to_df()
)
# Becomes:
# SELECT region, SUM(amount) FROM data
# WHERE amount > 100
# GROUP BY region ORDER BY sum DESC LIMIT 10
Benchmark: DataStore vs pandas
Test Environment
- Data: 10 million rows
- Hardware: Standard laptop
- File format: CSV
Results
| Operation | pandas (ms) | DataStore (ms) | Winner |
|---|
| GroupBy count | 347 | 17 | DataStore (19.93x) |
| Combined ops | 1,535 | 234 | DataStore (6.56x) |
| Complex pipeline | 2,047 | 380 | DataStore (5.39x) |
| MultiFilter+Sort+Head | 1,963 | 366 | DataStore (5.36x) |
| Filter+Sort+Head | 1,537 | 350 | DataStore (4.40x) |
| Head/Limit | 166 | 45 | DataStore (3.69x) |
| Ultra-complex (10+ ops) | 1,070 | 338 | DataStore (3.17x) |
| GroupBy agg | 406 | 141 | DataStore (2.88x) |
| Select+Filter+Sort | 1,217 | 443 | DataStore (2.75x) |
| Filter+GroupBy+Sort | 466 | 184 | DataStore (2.53x) |
| Filter+Select+Sort | 1,285 | 533 | DataStore (2.41x) |
| Sort (single) | 1,742 | 1,197 | DataStore (1.45x) |
| Filter (single) | 276 | 526 | Comparable |
| Sort (multiple) | 947 | 1,477 | Comparable |
Key Insights
- GroupBy operations: DataStore up to 19.93x faster
- Complex pipelines: DataStore 5-6x faster (SQL pushdown benefit)
- Simple slice operations: Performance comparable - difference negligible
- Best use case: Multi-step operations with groupby/aggregation
- Zero-copy:
to_df() has no data conversion overhead
When DataStore Wins
Heavy Aggregations
# DataStore excels: 19.93x faster
result = ds.groupby('category')['amount'].sum()
Complex Pipelines
# DataStore excels: 5-6x faster
result = (ds
.filter(ds['date'] >= '2024-01-01')
.filter(ds['amount'] > 100)
.groupby('region')
.agg({'amount': ['sum', 'mean', 'count']})
.sort('sum', ascending=False)
.head(20)
)
Large File Processing
# DataStore: Only loads what you need
ds = pd.read_parquet("huge_file.parquet")
result = ds.filter(ds['id'] == 12345).to_df() # Fast!
Multiple Column Operations
# DataStore: Combines into single SQL
ds['total'] = ds['price'] * ds['quantity']
ds['is_large'] = ds['total'] > 1000
ds = ds.filter(ds['is_large'])
When pandas Is Comparable
In most scenarios, DataStore matches or exceeds pandas performance. However, pandas may be slightly faster in these specific cases:
Small Datasets (<1,000 rows)
# For very small datasets, overhead is minimal for both
# Performance difference is negligible
small_df = pd.DataFrame({'x': range(100)})
Simple Slice Operations
# Single slice operations without aggregation
df = df[df['x'] > 10] # pandas slightly faster
ds = ds[ds['x'] > 10] # DataStore comparable
Custom Python Lambda Functions
# pandas required for custom Python code
def complex_function(row):
return custom_logic(row)
df['result'] = df.apply(complex_function, axis=1)
Important
Even in scenarios where DataStore is "slower", performance is typically on par with pandas - the difference is negligible for practical use. DataStore's advantages in complex operations far outweigh these edge cases.
For fine-grained control over execution, see Execution Engine Configuration.
Zero-Copy DataFrame Integration
DataStore uses zero-copy for reading and writing pandas DataFrames. This means:
# to_df() does NOT copy data - it's a zero-copy operation
result = ds.filter(ds['x'] > 10).to_df() # No data conversion overhead
# Same for creating DataStore from DataFrame
ds = DataStore(existing_df) # No data copy
Key implications:
to_df() is essentially free - no serialization or memory copying
- Creating DataStore from pandas DataFrame is instant
- Memory is shared between DataStore and pandas views
Optimization Tips
For aggregation-heavy workloads where you don't need exact pandas output format (row order, MultiIndex columns, dtype corrections), enable performance mode for maximum throughput:
from chdb.datastore.config import config
config.use_performance_mode()
# Now all operations use SQL-first execution with no pandas overhead:
# - Parallel Parquet reading (no preserve_order)
# - Single-SQL aggregation (filter+groupby in one query)
# - No row-order preservation overhead
# - No MultiIndex, no dtype corrections
result = (ds
.filter(ds['amount'] > 100)
.groupby('region')
.agg({'amount': ['sum', 'mean', 'count']})
)
Expected improvement: Up to 2-8x faster for filter+groupby workloads, reduced memory usage for large Parquet files.
See Performance Mode for full details.
2. Use Parquet Instead of CSV
# CSV: Slower, reads entire file
ds = pd.read_csv("data.csv")
# Parquet: Faster, columnar, compressed
ds = pd.read_parquet("data.parquet")
# Convert once, benefit forever
df = pd.read_csv("data.csv")
df.to_parquet("data.parquet")
Expected improvement: 3-10x faster reads
3. Filter Early
# Good: Filter first, then aggregate
result = (ds
.filter(ds['date'] >= '2024-01-01') # Reduce data early
.groupby('category')['amount'].sum()
)
# Less optimal: Process all data
result = (ds
.groupby('category')['amount'].sum()
.filter(ds['sum'] > 1000) # Filter too late
)
4. Select Only Needed Columns
# Good: Column pruning
result = ds.select('name', 'amount').filter(ds['amount'] > 100)
# Less optimal: All columns loaded
result = ds.filter(ds['amount'] > 100) # Loads all columns
5. Leverage SQL Aggregations
# GroupBy is where DataStore shines
# Up to 20x speedup!
result = ds.groupby('category').agg({
'amount': ['sum', 'mean', 'count', 'max'],
'quantity': 'sum'
})
6. Use head() Instead of Full Queries
# Don't load entire result if you only need a sample
result = ds.filter(ds['type'] == 'A').head(100) # LIMIT 100
# Avoid this for large results
# result = ds.filter(ds['type'] == 'A').to_df() # Loads everything
7. Batch Operations
# Good: Single execution
result = ds.filter(ds['x'] > 10).filter(ds['y'] < 100).to_df()
# Bad: Multiple executions
result1 = ds.filter(ds['x'] > 10).to_df() # Execute
result2 = result1[result1['y'] < 100] # Execute again
8. Use explain() to Optimize
# View the query plan before executing
query = ds.filter(...).groupby(...).agg(...)
query.explain() # Check if operations are pushed down
# Then execute
result = query.to_df()
Profiling Your Workload
Enable Profiling
from chdb.datastore.config import config, get_profiler
config.enable_profiling()
# Run your workload
result = your_pipeline()
# View report
profiler = get_profiler()
profiler.report()
Identify Bottlenecks
Performance Report
==================
Step Duration % Total
---- -------- -------
SQL execution 2.5s 62.5% <- Bottleneck!
read_csv 1.2s 30.0%
Other 0.3s 7.5%
Compare Approaches
# Test approach 1
profiler.reset()
result1 = approach1()
time1 = profiler.get_steps()[-1]['duration_ms']
# Test approach 2
profiler.reset()
result2 = approach2()
time2 = profiler.get_steps()[-1]['duration_ms']
print(f"Approach 1: {time1:.0f}ms")
print(f"Approach 2: {time2:.0f}ms")
Best Practices Summary
| Practice | Impact |
|---|
| Enable performance mode | 2-8x faster for aggregation workloads |
| Use Parquet files | 3-10x faster reads |
| Filter early | Reduce data processing |
| Select needed columns | Reduce I/O and memory |
| Use GroupBy/aggregations | Up to 20x faster |
| Batch operations | Avoid repeated execution |
| Profile before optimizing | Find real bottlenecks |
| Use explain() | Verify query optimization |
| Use head() for samples | Avoid full table scans |
Quick Decision Guide
| Your Workload | Recommendation |
|---|
| GroupBy/aggregation | Use DataStore |
| Complex multi-step pipeline | Use DataStore |
| Large files with filters | Use DataStore |
| Simple slice operations | Either (comparable performance) |
| Custom Python lambda functions | Use pandas or convert late |
| Very small data (<1,000 rows) | Either (negligible difference) |
Tip
For automatic optimal engine selection, use config.set_execution_engine('auto') (default).
For maximum throughput on aggregation workloads, use config.use_performance_mode().
See Execution Engine and Performance Mode for details.