Performance Guide

DataStore delivers significant performance improvements over pandas for many operations. This guide explains why and how to optimize your workloads.

Why DataStore Is Faster

1. SQL Pushdown

Operations are pushed down to the data source:

# pandas: Loads ALL data, then filters in memory
df = pd.read_csv("huge.csv")       # Load 10GB
df = df[df['year'] == 2024]        # Filter in Python

# DataStore: Filter at source
ds = pd.read_csv("huge.csv")       # Just metadata
ds = ds[ds['year'] == 2024]        # Filter in SQL
df = ds.to_df()                    # Only load filtered data

2. Column Pruning

Only needed columns are read:

# DataStore: Only reads name, age columns
ds = pd.read_parquet("wide_table.parquet")
result = ds.select('name', 'age').to_df()

# vs pandas: Reads all 100 columns, then selects

3. Lazy Evaluation

Multiple operations compile to one query:

# DataStore: One optimized SQL query
result = (ds
    .filter(ds['amount'] > 100)
    .groupby('region')
    .agg({'amount': 'sum'})
    .sort('sum', ascending=False)
    .head(10)
    .to_df()
)

# Becomes:
# SELECT region, SUM(amount) FROM data
# WHERE amount > 100
# GROUP BY region ORDER BY sum DESC LIMIT 10

Benchmark: DataStore vs pandas

Test Environment

Data: 10 million rows
Hardware: Standard laptop
File format: CSV

Results

Operation	pandas (ms)	DataStore (ms)	Winner
GroupBy count	347	17	DataStore (19.93x)
Combined ops	1,535	234	DataStore (6.56x)
Complex pipeline	2,047	380	DataStore (5.39x)
MultiFilter+Sort+Head	1,963	366	DataStore (5.36x)
Filter+Sort+Head	1,537	350	DataStore (4.40x)
Head/Limit	166	45	DataStore (3.69x)
Ultra-complex (10+ ops)	1,070	338	DataStore (3.17x)
GroupBy agg	406	141	DataStore (2.88x)
Select+Filter+Sort	1,217	443	DataStore (2.75x)
Filter+GroupBy+Sort	466	184	DataStore (2.53x)
Filter+Select+Sort	1,285	533	DataStore (2.41x)
Sort (single)	1,742	1,197	DataStore (1.45x)
Filter (single)	276	526	Comparable
Sort (multiple)	947	1,477	Comparable

Key Insights

GroupBy operations: DataStore up to 19.93x faster
Complex pipelines: DataStore 5-6x faster (SQL pushdown benefit)
Simple slice operations: Performance comparable - difference negligible
Best use case: Multi-step operations with groupby/aggregation
Zero-copy: to_df() has no data conversion overhead

When DataStore Wins

Heavy Aggregations

# DataStore excels: 19.93x faster
result = ds.groupby('category')['amount'].sum()

Complex Pipelines

# DataStore excels: 5-6x faster
result = (ds
    .filter(ds['date'] >= '2024-01-01')
    .filter(ds['amount'] > 100)
    .groupby('region')
    .agg({'amount': ['sum', 'mean', 'count']})
    .sort('sum', ascending=False)
    .head(20)
)

Large File Processing

# DataStore: Only loads what you need
ds = pd.read_parquet("huge_file.parquet")
result = ds.filter(ds['id'] == 12345).to_df()  # Fast!

Multiple Column Operations

# DataStore: Combines into single SQL
ds['total'] = ds['price'] * ds['quantity']
ds['is_large'] = ds['total'] > 1000
ds = ds.filter(ds['is_large'])

When pandas Is Comparable

In most scenarios, DataStore matches or exceeds pandas performance. However, pandas may be slightly faster in these specific cases:

Small Datasets (<1,000 rows)

# For very small datasets, overhead is minimal for both
# Performance difference is negligible
small_df = pd.DataFrame({'x': range(100)})

Simple Slice Operations

# Single slice operations without aggregation
df = df[df['x'] > 10]  # pandas slightly faster
ds = ds[ds['x'] > 10]  # DataStore comparable

Custom Python Lambda Functions

# pandas required for custom Python code
def complex_function(row):
    return custom_logic(row)

df['result'] = df.apply(complex_function, axis=1)

Important

Even in scenarios where DataStore is "slower", performance is typically on par with pandas - the difference is negligible for practical use. DataStore's advantages in complex operations far outweigh these edge cases.

For fine-grained control over execution, see Execution Engine Configuration.

Zero-Copy DataFrame Integration

DataStore uses zero-copy for reading and writing pandas DataFrames. This means:

# to_df() does NOT copy data - it's a zero-copy operation
result = ds.filter(ds['x'] > 10).to_df()  # No data conversion overhead

# Same for creating DataStore from DataFrame
ds = DataStore(existing_df)  # No data copy

Key implications:

to_df() is essentially free - no serialization or memory copying
Creating DataStore from pandas DataFrame is instant
Memory is shared between DataStore and pandas views

Optimization Tips

1. Enable Performance Mode for Heavy Workloads

For aggregation-heavy workloads where you don't need exact pandas output format (row order, MultiIndex columns, dtype corrections), enable performance mode for maximum throughput:

from chdb.datastore.config import config

config.use_performance_mode()

# Now all operations use SQL-first execution with no pandas overhead:
# - Parallel Parquet reading (no preserve_order)
# - Single-SQL aggregation (filter+groupby in one query)
# - No row-order preservation overhead
# - No MultiIndex, no dtype corrections
result = (ds
    .filter(ds['amount'] > 100)
    .groupby('region')
    .agg({'amount': ['sum', 'mean', 'count']})
)

Expected improvement: Up to 2-8x faster for filter+groupby workloads, reduced memory usage for large Parquet files.

See Performance Mode for full details.

2. Use Parquet Instead of CSV

# CSV: Slower, reads entire file
ds = pd.read_csv("data.csv")

# Parquet: Faster, columnar, compressed
ds = pd.read_parquet("data.parquet")

# Convert once, benefit forever
df = pd.read_csv("data.csv")
df.to_parquet("data.parquet")

Expected improvement: 3-10x faster reads

3. Filter Early

# Good: Filter first, then aggregate
result = (ds
    .filter(ds['date'] >= '2024-01-01')  # Reduce data early
    .groupby('category')['amount'].sum()
)

# Less optimal: Process all data
result = (ds
    .groupby('category')['amount'].sum()
    .filter(ds['sum'] > 1000)  # Filter too late
)

4. Select Only Needed Columns

# Good: Column pruning
result = ds.select('name', 'amount').filter(ds['amount'] > 100)

# Less optimal: All columns loaded
result = ds.filter(ds['amount'] > 100)  # Loads all columns

5. Leverage SQL Aggregations

# GroupBy is where DataStore shines
# Up to 20x speedup!
result = ds.groupby('category').agg({
    'amount': ['sum', 'mean', 'count', 'max'],
    'quantity': 'sum'
})

6. Use head() Instead of Full Queries

# Don't load entire result if you only need a sample
result = ds.filter(ds['type'] == 'A').head(100)  # LIMIT 100

# Avoid this for large results
# result = ds.filter(ds['type'] == 'A').to_df()  # Loads everything

7. Batch Operations

# Good: Single execution
result = ds.filter(ds['x'] > 10).filter(ds['y'] < 100).to_df()

# Bad: Multiple executions
result1 = ds.filter(ds['x'] > 10).to_df()  # Execute
result2 = result1[result1['y'] < 100]       # Execute again

8. Use explain() to Optimize

# View the query plan before executing
query = ds.filter(...).groupby(...).agg(...)
query.explain()  # Check if operations are pushed down

# Then execute
result = query.to_df()

Profiling Your Workload

Enable Profiling

from chdb.datastore.config import config, get_profiler

config.enable_profiling()

# Run your workload
result = your_pipeline()

# View report
profiler = get_profiler()
profiler.report()

Identify Bottlenecks

Performance Report
==================
Step                    Duration    % Total
----                    --------    -------
SQL execution           2.5s        62.5%     <- Bottleneck!
read_csv                1.2s        30.0%
Other                   0.3s        7.5%

Compare Approaches

# Test approach 1
profiler.reset()
result1 = approach1()
time1 = profiler.get_steps()[-1]['duration_ms']

# Test approach 2
profiler.reset()
result2 = approach2()
time2 = profiler.get_steps()[-1]['duration_ms']

print(f"Approach 1: {time1:.0f}ms")
print(f"Approach 2: {time2:.0f}ms")

Best Practices Summary

Practice	Impact
Enable performance mode	2-8x faster for aggregation workloads
Use Parquet files	3-10x faster reads
Filter early	Reduce data processing
Select needed columns	Reduce I/O and memory
Use GroupBy/aggregations	Up to 20x faster
Batch operations	Avoid repeated execution
Profile before optimizing	Find real bottlenecks
Use explain()	Verify query optimization
Use head() for samples	Avoid full table scans

Quick Decision Guide

Your Workload	Recommendation
GroupBy/aggregation	Use DataStore
Complex multi-step pipeline	Use DataStore
Large files with filters	Use DataStore
Simple slice operations	Either (comparable performance)
Custom Python lambda functions	Use pandas or convert late
Very small data (<1,000 rows)	Either (negligible difference)

Tip

For automatic optimal engine selection, use config.set_execution_engine('auto') (default). For maximum throughput on aggregation workloads, use config.use_performance_mode(). See Execution Engine and Performance Mode for details.

Why DataStore Is Faster​

1. SQL Pushdown​

2. Column Pruning​

3. Lazy Evaluation​

Benchmark: DataStore vs pandas​

Test Environment​

Results​

Key Insights​

When DataStore Wins​

Heavy Aggregations​

Complex Pipelines​

Large File Processing​

Multiple Column Operations​

When pandas Is Comparable​

Small Datasets (<1,000 rows)​

Simple Slice Operations​

Custom Python Lambda Functions​

Zero-Copy DataFrame Integration​

Optimization Tips​

1. Enable Performance Mode for Heavy Workloads​

2. Use Parquet Instead of CSV​

3. Filter Early​

4. Select Only Needed Columns​

5. Leverage SQL Aggregations​

6. Use head() Instead of Full Queries​

7. Batch Operations​

8. Use explain() to Optimize​

Profiling Your Workload​

Enable Profiling​

Identify Bottlenecks​

Compare Approaches​

Best Practices Summary​

Quick Decision Guide​