A machine learning system for predicting short-term mid-price moves from order book features and backtesting trading signals.
Report: ML_Microstructure_Signals Report.pdf
Note: This is a student project - some features are experimental and results may vary.
# Clone the repository
git clone https://github.com/IsmailMoudden/ML-Microstructure-Signals.git
cd ML-Microstructure-Signals
# Install in development mode
pip install -e ".[dev]"
# Or install production dependencies only
pip install -e .# Train a model with synthetic data
python -m ml_microstructure.pipeline.train config=configs/model/lgbm.yaml
# Make predictions
python -m ml_microstructure.pipeline.predict run_id=<mlflow_run_id>
# Evaluate model performance
python -m ml_microstructure.pipeline.evaluate run_id=<mlflow_run_id>
# Run backtest
python -m ml_microstructure.backtest.run run_id=<mlflow_run_id>
# Launch dashboard
streamlit run ml_microstructure/dashboards/streamlit_app.py- Synthetic LOB: Poisson arrival generator for testing
- LOBSTER: High-frequency order book data (config available)
- Kaggle Crypto: Cryptocurrency order book data (config available)
- Order Flow Imbalance (OFI): Multi-level order flow analysis
- Spread Features: Bid-ask spread dynamics
- Depth Features: Order book depth analysis
- Imbalance Features: Queue imbalance metrics
- VWAP Features: Volume-weighted average price
- Rolling Returns: Multi-horizon return features
- Microprice: Weighted mid-price calculation
- Baseline Models: Logistic Regression, Random Forest, LightGBM
- Sequence Models: LSTM, Transformer (available in code)
- Hyperparameter Optimization: Optuna integration
- Model Persistence: MLflow tracking
- Signal Generation: Probability-to-signal mapping
- Execution Engine: Transaction costs, slippage, position sizing
- Performance Metrics: Sharpe, Sortino, Calmar ratios, drawdown analysis
- Walk-Forward Analysis: Out-of-sample testing
- Live Replay: Real-time feature visualization
- Model Analysis: Feature importance, prediction confidence
- Backtest Results: Equity curves, drawdown analysis
- Interactive Charts: Plotly-based visualizations
ml_microstructure/
├── data/ # Data loaders and processors
├── features/ # Feature extraction pipeline
├── models/ # ML model implementations
├── pipeline/ # Training, prediction, evaluation
├── backtest/ # Signal generation and backtesting
├── dashboards/ # Streamlit dashboard
└── utils/ # Utilities and helpers
configs/ # Hydra configuration files
tests/ # Unit and integration tests
notebooks/ # Jupyter notebooks for EDA
reports/ # LaTeX research reports
from ml_microstructure.data import SyntheticLOBGenerator, OrderBookProcessor
from ml_microstructure.features import FeaturePipeline
# Generate synthetic data
generator = SyntheticLOBGenerator(
initial_price=100.0,
tick_size=0.01,
max_levels=10,
arrival_rate=100.0,
duration_seconds=3600
)
snapshots = generator.generate_data()
# Process into DataFrame
processor = OrderBookProcessor(max_levels=10)
df = processor.process_snapshots(snapshots)
# Extract features
pipeline = FeaturePipeline()
df_features = pipeline.extract_features(df)from ml_microstructure.models import ModelFactory, ModelConfig
from ml_microstructure.utils.labeling import LabelGenerator
# Generate labels
label_generator = LabelGenerator(horizon=1, threshold=0.001)
labels = label_generator.generate_labels(df_features)
# Prepare training data
X = df_features.drop(['timestamp'], axis=1)
y = labels
# Train model
config = ModelConfig(model_type="lightgbm")
model = ModelFactory.create_model(config)
model.fit(X, y)from ml_microstructure.backtest import BacktestRunner
# Run backtest
runner = BacktestRunner(config)
results = runner.run(run_id="your_mlflow_run_id")
# View results
print(results["report"])The system uses Hydra for configuration management. Key configuration files:
configs/train.yaml: Training pipeline configurationconfigs/predict.yaml: Prediction pipeline configurationconfigs/backtest.yaml: Backtesting configurationconfigs/model/*.yaml: Model-specific parameters
# configs/train.yaml
data:
type: synthetic
synthetic:
initial_price: 100.0
tick_size: 0.01
max_levels: 10
arrival_rate: 100.0
duration_seconds: 3600
model:
type: lightgbm
params:
n_estimators: 100
max_depth: 6
learning_rate: 0.1
labeling:
horizon: 1
threshold: 0.001
method: ternaryThe backtesting system calculates comprehensive performance metrics:
- Return Metrics: Annualized return, Sharpe ratio, Sortino ratio
- Risk Metrics: Maximum drawdown, Value at Risk (VaR)
- Trade Metrics: Hit rate, profit factor, turnover
- Risk-Adjusted Metrics: Calmar ratio, information ratio
# Run all tests
pytest
# Run with coverage
pytest --cov=ml_microstructure --cov-report=html
# Run specific test categories
pytest -m "not slow" # Skip slow tests
pytest -m integration # Integration tests only- Report: ML_Microstructure_Signals_Report.pdf - Complete report with results and analysis
- Report Sources: LaTeX source files in
reports/paper/sources/ - API Documentation: Available in
docs/(generated with Sphinx) - Jupyter Notebooks: Examples in
notebooks/
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
# Install pre-commit hooks
pre-commit install
# Run linting
ruff check .
black --check .
# Run type checking
mypy ml_microstructure/- Synthetic data only: Models tested mainly on generated data, no validation on real high-frequency data
- Simplified costs: Transaction costs and slippage modeled in a basic way, no signal→order latency
- No baseline: Missing comparison with Buy&Hold or simple strategies (SMA)
- Basic ML metrics: Focus on financial metrics, missing AUC-PR, calibration, Brier score
- PyTorch complexity: LSTM/Transformer implementation difficult, memory issues with large datasets
- Hydra configuration: Initially confusing setup, complex hierarchy to master
- Unit tests: Tedious to write, difficult edge cases to cover
- Realistic backtest: Balancing realism and simplicity is complicated
- No walk-forward analysis enabled by default
- Basic missing data handling
- No operational risk management
- Manual hyperparameter optimization
This project is licensed under the MIT License - see the LICENSE file for details.
- LOBSTER: High-frequency order book data
- Kaggle: Cryptocurrency datasets
- MLflow: Experiment tracking
- Hydra: Configuration management
- Streamlit: Dashboard framework
For questions and support:
- Email: ismail.moudden1@gmail.com
- Issues: GitHub Issues
- Documentation: Wiki
- Real-time data streaming integration
- Advanced sequence models (Transformer variants)
- Multi-asset backtesting
- Risk management modules
- Cloud deployment templates
- Additional data sources (Binance, Coinbase)
Disclaimer: This software is for educational and research purposes only. It is not intended for live trading without proper risk management and testing.