Skip to content

AnkitPimpalkar/BankSight

Repository files navigation

BankSight: A Research-Driven Approach to Financial Forecasting

Welcome to BankSight, a deep-dive project into forecasting the BankNifty index. This repository documents a complete workflow, from initial data research and model experimentation to building a reproducible MLOps pipeline and deploying the final model via an API.

The project's primary focus is on the research and development process, demonstrating how to systematically approach a time-series forecasting problem.

The inference API developed from this project is now powering a live application. You can interact with the model's predictions at [https://finsight-phuysyk4na-el.a.run.app/].

1. The Research Journey: Finding a Signal in the Noise

The core of this project lies in the experimental notebooks found in the research/ directory. The goal was not just to build a model, but to find a robust and verifiable one through careful data handling and extensive hyperparameter tuning.

Data Preparation & Integrity (research/00_Data_Split_and_preprocessing.ipynb)

The foundation of any good model is clean, reliable data. The first phase of research focused on this:

  • Data Aggregation: Historical daily data for the BankNifty index (^NSEBANK), the broader NIFTY 50 index (^NSEI), and its constituent banks were downloaded using the yfinance library.
  • Temporal Alignment: A critical step was ensuring all time-series data were perfectly aligned. The notebook implements logic to identify and drop dates where data was missing for the primary index, ensuring the integrity of sequences.
  • Leakage Prevention: To prevent the model from accidentally "seeing" future data during training, a strict 70/15/15 split (Train/Validation/Test) was performed. Furthermore, a 200-day buffer was added to the validation and test sets. This ensures that features engineered from rolling windows do not leak information across splits. The notebook output confirms these buffered shapes, for example: bn_Val+buffer: (494, 5).

Feature Engineering (research/01_feature_eng.ipynb)

This notebook expands the raw OHLCV data into a rich feature set to provide the model with more context.

  • Technical Indicators: A comprehensive suite of indicators was generated using the talib library, including RSI, MACD, Bollinger Bands, and ATR.
  • Market Context: To capture the broader market dynamics, features like the return spread between BankNifty and the NIFTY 50 were created.
  • Breadth Indicators: The model's view was further enriched by analyzing the constituent bank stocks. The notebook calculates the percentage of banks trading above their 20-day and 50-day moving averages, along with the dispersion of their daily returns. This provides a measure of internal market health.
  • Sentiment Proxy (Experimental): An attempt was made to create a proxy for a "Fear & Greed" index using momentum, volatility, and price strength. While this feature was ultimately excluded to simplify the model, it represents a valid research path.
  • Data Cleaning: After feature generation, which involves rolling windows, the resulting NaN values at the beginning of the dataset were dropped to ensure data integrity for the next stage.

Feature Selection (research/02_feature_selection.ipynb)

With a large number of features, the next step was to identify the most predictive ones to reduce model complexity and noise.

  • Importance Analysis with XGBoost: An XGBRegressor model was trained to rank features based on their contribution. The notebook analyzes multiple importance types (gain, weight, cover) to get a robust understanding of each feature's value.
  • Iterative Selection: Based on the XGBoost results and further experimentation with an LSTM model (tracked via MLflow), a final set of features was selected. The notebook shows this iterative process, with different feature lists being tested.
  • Final Feature Set: The chosen features, such as ["Close", "Low", "Return", "High", "banks_stock_dispersion", "hl_range", "Open"], were saved to feature_final.json. This file is then used in subsequent model training and evaluation steps to ensure consistency.

Model Tuning with Optuna & MLflow (research/03_model_tuning.ipynb)

With a clean dataset, the search for the optimal LSTM model began. This was conducted systematically using Optuna for hyperparameter optimization, with every experiment tracked by MLflow.

  • Experiment Tracking: All trials were logged under the MLflow experiment "BankSight_Research". This captured parameters, performance metrics, and the model itself for each run, creating a fully reproducible research history.
  • Hyperparameter Search Space: Optuna was configured to explore a wide range of hyperparameters to understand their impact on model performance:
    • SEQ_LENGTH: The number of past days to use for a prediction (5 to 80).
    • lstm_units: The complexity of the LSTM layer (32 to 128).
    • dropout_rate: For regularization (0.1 to 0.5).
    • batch_size, epochs, optimizer, and patience for early stopping.
  • Key Findings: The Optuna logs from the notebook reveal the discovery process. The hyperparameter search was conducted for 168 trials, at which point the experiment was concluded due to computational constraints. While more exploratory and extensive tuning could potentially yield further improvements, the search successfully identified a robust model that achieved a validation Mean Absolute Error (MAE) of approximately 1490. This was a significant improvement over initial trials, which had MAE values well above 3000.

    ...Best is trial 110 with value: 1490.6560546875.

Robustness Check with Time Series Cross-Validation (research/04_TSCV_evaluation.ipynb)

To ensure the model's performance is stable over time and not just a result of a favorable train-test split, a more rigorous validation strategy was employed.

  • Rolling-Origin Validation: The notebook uses TimeSeriesSplit from scikit-learn to create multiple (6) training and testing folds. In each fold, the model is trained on past data and tested on a subsequent, unseen period. This simulates how the model would perform as it is periodically retrained and used to predict the future.
  • Consistent Performance: The model was retrained on each fold, and its performance was logged. The results showed a stable Mean Absolute Percentage Error (MAPE) across the different time periods, with an average of 3.01%. By excluding the first fold, the average MAPE drops from 3.01% to approximately 1.76%. This confirmed that the model's accuracy is not dependent on a specific time window.

8/8 ━━━━━━━━━━━━━━━━━━━━ 1s 106ms/step Fold 1: MAE=2109.55, MAPE=9.27% 8/8 ━━━━━━━━━━━━━━━━━━━━ 1s 103ms/step Fold 2: MAE=561.80, MAPE=1.87% 8/8 ━━━━━━━━━━━━━━━━━━━━ 1s 132ms/step Fold 3: MAE=527.80, MAPE=1.43% 8/8 ━━━━━━━━━━━━━━━━━━━━ 2s 148ms/step Fold 4: MAE=910.41, MAPE=2.15% 8/8 ━━━━━━━━━━━━━━━━━━━━ 1s 131ms/step Fold 5: MAE=1063.22, MAPE=2.18% 8/8 ━━━━━━━━━━━━━━━━━━━━ 1s 76ms/step Fold 6: MAE=636.16, MAPE=1.19%

Average MAE: 968.16 Average MAPE: 3.01%

  • Backtesting Data Generation: The predictions from each of the cross-validation folds were collected and saved to TSCV_predictions_for_backtest.csv. This file provides a continuous set of out-of-sample predictions, which is essential for an unbiased backtest.

Backtesting & Performance Validation (research/05_Backtest.ipynb)

A low forecast error doesn't always translate to a profitable strategy. The final research step involved backtesting the model's predictions using the vectorbt library to simulate trading performance. This notebook evaluates the model on key financial metrics beyond simple accuracy, such as:

  • Sharpe Ratio & Sortino Ratio
  • Profit Factor
  • Trade Expectancy

This crucial step connects the model's predictive power to real-world strategic value.

2. The MLOps Pipeline: From Research to Reproducibility

The insights gained from research were formalized into a structured, automated pipeline using the components in src/banksight_ml/. This ensures that the model can be retrained on new data in a consistent and reliable manner.

The pipeline is orchestrated by main.py and consists of three main stages:

  1. Data Ingestion: This stage, defined in stage_01_data_processing, automatically downloads the required financial data based on dates specified in params.yaml.
  2. Feature Engineering: The stage_02_feature_eng pipeline processes the raw data, creating the features used for model training.
  3. Model Training: The final stage (stage_03_model_training) takes the engineered features, scales them, and trains the LSTM model using the best hyperparameters discovered during research (which are stored in params.yaml). It then saves the trained model (model.h5) and the necessary scalers (scaler.pkl, scaler_y.pkl) to the artifacts/ directory.

This entire workflow is managed by a central Configuration_Manager that reads from configuration files and provides type-safe dataclass objects (e.g., ModelTrainingConfig) to each pipeline component, minimizing errors and enhancing clarity.

3. The API Layer: Serving the Model

The final piece of the project is the API, located in the BankSight_API/ directory. This lightweight Flask application serves the primary purpose of exposing the trained model for real-time inference.

  • Core Components:
    • app.py: The Flask server that defines the /predict endpoint.
    • inference.py: A helper module that loads the saved model and scalers from the artifacts/ directory and contains the logic to preprocess input data and generate a prediction.
  • Deployment: The API is containerized using the provided Dockerfile for easy deployment. As mentioned, this API is already deployed and in use.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published