Is Python Too Slow for Crypto Trading? We Ran the Numbers.
Everyone in crypto trading eventually hears it: "Python is too slow, you need C++ or Rust."
Most people saying it are not wrong. But most people saying it are also not answering the right question.
The right question is not "is Python slow?" It is "is Python slow relative to the thing you're waiting on?" In simple onchain trading, what you're waiting on is a block. And blocks have a hard floor that Python blows past with room to spare on most chains.
But that answer changes the moment you're using Jito bundles on Solana or MEV-Boost bundlers on Ethereum. When you're submitting ordered transaction bundles and competing with other bots for inclusion, you're not racing the block anymore. You're racing every other bot reacting to the same signal. That latency budget is 1-2ms, not 400ms or 12 seconds. And at that point, C++ starts mattering.
We benchmarked all three. Here are the actual numbers.
Setup
All benchmarks for Python and C++ were run on the same laptop with 8GB RAM. This is deliberate worst-case scenario testing. A production trading server with 32GB+ RAM, a faster CPU, and no background processes will show significantly better absolute times.
The relative ratios between implementations for NumPy vs pure Python, C++ vs Pandas — hold across hardware. If Python is fast enough on a consumer laptop, it is fast enough on your server.
C++ compiled with g++ -O2. Pure Python is benchmarked on 10,000 rows and extrapolated linearly (it becomes unusable at 1M). Each test runs 5 times and we take the median. No warm-up tricks, no cherry-picking. Same hardware means the cross-language comparisons are direct.
The C++ setup used for all snippets below:
#include <iostream>
#include <vector>
#include <cmath>
#include <chrono>
#include <numeric>
#include <algorithm>
#include <map>
#include <random>
using namespace std;
using namespace std::chrono;
const int N = 1000000;
const int RUNS = 5;
template<typename Fn>
double timeit(Fn fn) {
vector<double> times;
for (int r = 0; r < RUNS; r++) {
auto start = high_resolution_clock::now();
fn();
auto end = high_resolution_clock::now();
times.push_back(duration<double, milli>(end - start).count());
}
sort(times.begin(), times.end());
return times[RUNS / 2]; // median
}
Compile and run with:
g++ -O2 -o benchmark benchmark.cpp && ./benchmark
import time
import math
import random
import statistics
import numpy as np
import pandas as pd
random.seed(42)
np.random.seed(42)
N = 1_000_000 # 1M ticks — realistic intraday dataset
N_SMALL = 10_000 # pure Python cap before it becomes unusable
RUNS = 5
def timeit(fn, runs=RUNS):
times = []
for _ in range(runs):
start = time.perf_counter()
fn()
end = time.perf_counter()
times.append((end - start) * 1000) # ms
return statistics.median(times)
# generate synthetic price data
prices_list = [100.0 + random.gauss(0, 1) for _ in range(N)]
prices_small = prices_list[:N_SMALL]
prices_np = np.array(prices_list, dtype=np.float64)
prices_pd = pd.Series(prices_list)
Benchmark 1: Log Returns
Log returns (log(p_t / p_{t-1})) are the most fundamental operation in any quant pipeline. This is what you're feeding into almost every ML model and strategy signal.
# Pure Python
def log_returns_pure():
return [math.log(prices_small[i] / prices_small[i-1])
for i in range(1, len(prices_small))]
# NumPy
def log_returns_numpy():
return np.diff(np.log(prices_np))
# Pandas
def log_returns_pandas():
return np.log(prices_pd / prices_pd.shift(1)).dropna()
t_pure = timeit(log_returns_pure)
t_np = timeit(log_returns_numpy)
t_pd = timeit(log_returns_pandas)
print(f"Pure Python (10K, extrap to 1M): {t_pure * 100:.1f} ms")
print(f"NumPy (1M rows): {t_np:.2f} ms")
print(f"Pandas (1M rows): {t_pd:.2f} ms")
Results on 1M rows:
| Implementation | Time | vs Pure Python |
|---|---|---|
| Pure Python (extrapolated) | ~78.8 ms | baseline |
| NumPy | 4.80 ms | 16x faster |
| Pandas | 8.78 ms | 9x faster |
C++ equivalent:
auto log_returns = [&]() {
vector<double> ret(N - 1);
for (int i = 1; i < N; i++)
ret[i-1] = log(prices[i] / prices[i-1]);
return ret;
};
// C++ result: 4.77 ms — essentially the same as NumPy (4.80 ms)
// NumPy's C backend is competitive with hand-written C++ for
// parallel operations like this
Benchmark 2: Rolling Mean (20-period)
WINDOW = 20
# Pure Python
def rolling_mean_pure():
result = []
for i in range(WINDOW - 1, len(prices_small)):
result.append(sum(prices_small[i - WINDOW + 1:i + 1]) / WINDOW)
return result
# NumPy
def rolling_mean_numpy():
kernel = np.ones(WINDOW) / WINDOW
return np.convolve(prices_np, kernel, mode='valid')
# Pandas
def rolling_mean_pandas():
return prices_pd.rolling(WINDOW).mean().dropna()
t_pure = timeit(rolling_mean_pure)
t_np = timeit(rolling_mean_numpy)
t_pd = timeit(rolling_mean_pandas)
Results on 1M rows:
| Implementation | Time | vs Pure Python |
|---|---|---|
| Pure Python (extrapolated) | ~304 ms | baseline |
| NumPy | 6.69 ms | 45x faster |
| Pandas | 18.09 ms | 17x faster |
C++ equivalent — sliding window, O(n):
const int WINDOW = 20;
auto rolling_mean = [&]() {
vector<double> result(N - WINDOW + 1);
double window_sum = 0;
for (int i = 0; i < WINDOW; i++) window_sum += prices[i];
result[0] = window_sum / WINDOW;
for (int i = WINDOW; i < N; i++) {
window_sum += prices[i] - prices[i - WINDOW];
result[i - WINDOW + 1] = window_sum / WINDOW;
}
return result;
};
// C++ result: 1.47 ms — 4.5x faster than NumPy (6.69 ms)
Benchmark 3: EMA — Where NumPy Loses
This one is instructive. EMA has a recursive dependency: each value depends on the previous one. That means it cannot be vectorized with standard NumPy operations. A NumPy loop is actually slower than pure Python because of array indexing overhead on each iteration.
ALPHA = 0.1
# Pure Python
def ema_pure():
ema = [prices_small[0]]
for p in prices_small[1:]:
ema.append(ALPHA * p + (1 - ALPHA) * ema[-1])
return ema
# NumPy loop — looks fast, is not
def ema_numpy():
out = np.empty(len(prices_np))
out[0] = prices_np[0]
for i in range(1, len(prices_np)):
out[i] = ALPHA * prices_np[i] + (1 - ALPHA) * out[i-1]
return out
# Pandas ewm — uses a C extension internally
def ema_pandas():
return prices_pd.ewm(alpha=ALPHA, adjust=False).mean()
t_pure = timeit(ema_pure)
t_np = timeit(ema_numpy)
t_pd = timeit(ema_pandas)
Results on 1M rows:
| Implementation | Time | vs Pure Python |
|---|---|---|
| Pure Python (extrapolated) | ~62.9 ms | baseline |
| NumPy loop | 384.58 ms | 6x slower |
| Pandas ewm | 9.06 ms | 7x faster |
C++ equivalent — this is where C++ reclaims the advantage:
const double ALPHA = 0.1;
auto ema = [&]() {
vector<double> result(N);
result[0] = prices[0];
for (int i = 1; i < N; i++)
result[i] = ALPHA * prices[i] + (1.0 - ALPHA) * result[i-1];
return result;
};
// C++ result: 3.30 ms — 2.7x faster than Pandas ewm (9.06 ms)
// NumPy loop: 384.58 ms — C++ is 116x faster
The NumPy loop is 6x slower than pure Python because accessing individual array elements inside a Python loop carries more overhead than working with native Python lists. Pandas ewm wins because it calls into a compiled C extension.
The lesson: NumPy is not always the answer. Pandas' specialized methods (ewm, rolling, groupby) often outperform naive NumPy because they route through optimized C code.
Benchmark 4: OHLCV Aggregation from Ticks
Aggregating raw ticks into block-level OHLCV candles is the first thing any onchain pipeline does. Here we simulate 100 ticks per block across 10,000 blocks.
ticks = pd.DataFrame({
'block': np.repeat(np.arange(N // 100), 100),
'price': prices_np,
'volume': np.abs(np.random.randn(N)) * 1000
})
# Pure Python
def ohlcv_pure():
result = {}
for row in ticks[:N_SMALL].itertuples():
b = row.block
if b not in result:
result[b] = {'open': row.price, 'high': row.price,
'low': row.price, 'close': row.price, 'volume': 0}
result[b]['high'] = max(result[b]['high'], row.price)
result[b]['low'] = min(result[b]['low'], row.price)
result[b]['close'] = row.price
result[b]['volume'] += row.volume
return result
# Pandas groupby
def ohlcv_pandas():
return ticks.groupby('block').agg(
open=('price', 'first'),
high=('price', 'max'),
low=('price', 'min'),
close=('price', 'last'),
volume=('volume', 'sum')
)
t_pure = timeit(ohlcv_pure)
t_pd = timeit(ohlcv_pandas)
Results on 1M rows:
| Implementation | Time | vs Pure Python |
|---|---|---|
| Pure Python (extrapolated) | ~908 ms | baseline |
| Pandas groupby | 35.65 ms | 25x faster |
C++ equivalent:
struct OHLCV { double open, high, low, close, volume; };
auto ohlcv = [&]() {
map<int, OHLCV> result;
for (int i = 0; i < N; i++) {
int block = i / 100;
if (result.find(block) == result.end())
result[block] = {prices[i], prices[i], prices[i], prices[i], 0};
auto& c = result[block];
c.high = max(c.high, prices[i]);
c.low = min(c.low, prices[i]);
c.close = prices[i];
c.volume += volume[i];
}
return result;
};
// C++ std::map result: 55.73 ms
// Note: std::map uses O(log n) lookup — Pandas groupby wins at 35.65 ms
// Switching to unordered_map brings C++ down to ~8ms
// Rare case where Pandas beats naive C++
Benchmark 5: Rolling Z-Score Normalization
This is what you're running before feeding data into any ML model.
ZWINDOW = 30
# Pure Python
def zscore_pure():
result = []
for i in range(ZWINDOW - 1, len(prices_small)):
window = prices_small[i - ZWINDOW + 1:i + 1]
mu = sum(window) / ZWINDOW
std = math.sqrt(sum((x - mu)**2 for x in window) / ZWINDOW)
result.append((prices_small[i] - mu) / (std + 1e-9))
return result
# NumPy strided windows
def zscore_numpy():
shape = (len(prices_np) - ZWINDOW + 1, ZWINDOW)
strides = (prices_np.strides[0], prices_np.strides[0])
windows = np.lib.stride_tricks.as_strided(prices_np, shape=shape, strides=strides)
means = windows.mean(axis=1)
stds = windows.std(axis=1)
return (prices_np[ZWINDOW-1:] - means) / (stds + 1e-9)
# Pandas rolling
def zscore_pandas():
roll = prices_pd.rolling(ZWINDOW)
return (prices_pd - roll.mean()) / roll.std()
t_pure = timeit(zscore_pure)
t_np = timeit(zscore_numpy)
t_pd = timeit(zscore_pandas)
Results on 1M rows:
| Implementation | Time | vs Pure Python |
|---|---|---|
| Pure Python (extrapolated) | ~3,477 ms | baseline |
| NumPy strided | 150.55 ms | 23x faster |
| Pandas rolling | 41.66 ms | 83x faster |
C++ equivalent:
const int ZWINDOW = 30;
auto zscore = [&]() {
vector<double> result(N - ZWINDOW + 1);
for (int i = ZWINDOW - 1; i < N; i++) {
double sum = 0, sq_sum = 0;
for (int j = i - ZWINDOW + 1; j <= i; j++) {
sum += prices[j];
sq_sum += prices[j] * prices[j];
}
double mu = sum / ZWINDOW;
double std = sqrt(sq_sum / ZWINDOW - mu * mu);
result[i - ZWINDOW + 1] = (prices[i] - mu) / (std + 1e-9);
}
return result;
};
// C++ result: 6.11 ms — 6.8x faster than Pandas (41.66 ms)
// NumPy strided: 150.55 ms — C++ is 24.6x faster
The Number That Actually Matters: Block Time
Here is where the Python vs C++ debate either becomes relevant or completely irrelevant depending on which chain you're trading.
block_times_ms = {
"Ethereum": 12_000, # ~12 seconds
"Polygon": 2_000, # ~2 seconds
"BSC": 3_000, # ~3 seconds
"Solana": 397, # ~400ms
}
# Best-case Python: NumPy log returns on 1M rows
best_python_ms = timeit(log_returns_numpy) # 3.36 ms
for chain, block_ms in block_times_ms.items():
pct = (best_python_ms / block_ms) * 100
print(f"{chain}: Python uses {pct:.2f}% of your block time")
Results:
| Chain | Block Time | NumPy signal (1M rows) | Python as % of block | Margin left |
|---|---|---|---|---|
| Ethereum | 12,000 ms | 4.80 ms | 0.04% | 11,995 ms |
| Polygon | 2,000 ms | 4.80 ms | 0.24% | 1,995 ms |
| BSC | 3,000 ms | 4.80 ms | 0.16% | 2,995 ms |
| Solana | 400 ms | 4.80 ms | 1.20% | 395 ms |
Now run the same comparison for pure Python (extrapolated from 10K benchmark):
| Chain | Block Time | Pure Python (1M rows) | Python as % of block |
|---|---|---|---|
| Ethereum | 12,000 ms | ~78.8 ms | 0.66% |
| Polygon | 2,000 ms | ~78.8 ms | 3.94% |
| BSC | 3,000 ms | ~78.8 ms | 2.63% |
| Solana | 400 ms | ~78.8 ms | 19.7% |
What the Numbers Are Actually Saying
On Ethereum, BSC, and Polygon: Even badly written pure Python is consuming under 6% of your available block time for a 1M-row signal computation. NumPy drops that to under 0.2%. Python is not your bottleneck.
On Solana: Pure Python is consuming nearly 20% of a block window on a 1M-row computation. That starts to matter. If you're running multiple signals, preprocessing, and model inference in the same loop, you can run out of margin. NumPy brings that back down to under 1%, which is fine. But this is where poorly written Python actually costs you trades.
The EMA case shows something important: the "use NumPy for everything" instinct is wrong. NumPy loops are slower than pure Python because of element-access overhead. For operations with recursive dependencies (EMA, cumulative sums, any state-carrying computation), Pandas' specialized methods beat both because they call into compiled C extensions. Knowing which tool to reach for matters more than defaulting to NumPy everywhere.
When Block Time Is Not Your Actual Budget: Jito and MEV Bundlers
The block time argument breaks down once you're using infrastructure designed to order transactions within a block.
Jito on Solana is a modified validator client that accepts bundles of transactions with attached tips. Bots submit bundles to Jito block engines, which order them by tip size. If your signal fires and you take 80ms to compute and submit your bundle, every other bot that reacted in 20ms is ahead of you in the queue regardless of the 400ms block window. The effective latency budget for competitive Jito strategies is closer to 20-50ms from signal to bundle submission.
MEV-Boost on Ethereum works similarly. Searchers submit transaction bundles to block builders (via relays like Flashbots, BloXroute, Titan). Builders sort and include bundles before building the block. Your Python signal computation time is part of the loop from "event observed" to "bundle submitted." At 12-second blocks, this feels comfortable, but competitive searchers are reacting in milliseconds, and late bundles get crowded out by faster ones bidding higher.
In both cases the question shifts from "am I faster than a block" to "am I faster than the other bots."
What this means for C++ vs Python:
| Operation | Best Python (Pandas/NumPy) | C++ (-O2) | Difference |
|---|---|---|---|
| Log returns | 4.80 ms (NumPy) | 4.77 ms | ~same |
| Rolling mean | 6.69 ms (NumPy) | 1.47 ms | 4.5x faster |
| EMA | 9.06 ms (Pandas) | 3.30 ms | 2.7x faster |
| Rolling z-score | 41.66 ms (Pandas) | 6.11 ms | 6.8x faster |
A full signal pipeline in Python (several operations chained) might take 30-60ms. The same pipeline in C++ runs in 5-15ms. At Jito tip auction speeds, that 20-50ms gap is meaningful. It is not the difference between "works" and "doesn't work" on Ethereum, but on Solana with Jito it can be the difference between landing in the bundle and getting crowded out.
The honest answer: Python with good libraries is sufficient for most onchain strategies. C++ becomes worth the complexity cost specifically when you're running competitive MEV or Jito strategies where you're directly timing out against other bots in the same bundle auction.
Summary Table
| Operation | Pure Python | Best Python lib | C++ (-O2) | C++ vs best Python |
|---|---|---|---|---|
| Log returns | ~78.8 ms | 4.80 ms (NumPy) | 4.77 ms | ~same |
| Rolling mean (20) | ~304 ms | 6.69 ms (NumPy) | 1.47 ms | 4.5x faster |
| EMA (α=0.1) | ~62.9 ms | 9.06 ms (Pandas) | 3.30 ms | 2.7x faster |
| OHLCV aggregation | ~908 ms | 35.65 ms (Pandas) | 55.73 ms (map) ⚠️ | Pandas wins here |
| Rolling z-score (30) | ~3,477 ms | 41.66 ms (Pandas) | 6.11 ms | 6.8x faster |
⚠️ C++ std::map uses O(log n) lookup — switching to unordered_map brings OHLCV down to ~8ms and reclaims the lead. Pandas wins with the naive map implementation.
All benchmarks run on the same 8GB RAM — worst case. g++ -O2. Cross-language comparisons are direct.
When Python Actually Is Too Slow
There are two cases where the claim holds without qualification:
Sub-block execution on Solana. If you're trying to land a transaction in the next slot (400ms), you cannot afford 20% of that window on signal computation with pure Python. Write NumPy or move the hot path to a compiled language.
Co-location and HFT-style strategies. If you're competing on nanosecond latency against C++ market makers on a centralized exchange, Python loses regardless of how well you write it.
Large model inference in the critical path. Running a PyTorch model inside your execution loop on every tick is a different problem from signal computation. Model inference latency varies wildly and deserves its own benchmark.
The Actual Takeaway
Write clean NumPy and Pandas for your signal pipeline. Use Pandas' specialized methods (ewm, rolling, groupby) rather than defaulting to NumPy loops for everything. Profile before you optimize.
All benchmarks run on a laptop with 8GB RAM — worst-case scenario testing by design. A production server will beat these numbers across the board. The relative ratios between implementations hold regardless of hardware.
Trading infrastructure questions? Reach out.