docs: add architecture.rst for algorithm rationale, testing, versioning (#1181)

mhmotamedi · web-flow · commit 1a844301f3ba · 2025-07-30T18:51:41.000Z
* docs: add architecture.rst for algorithm rationale, testing, and versioning details

* docs: remove manual table of contents from architecture.rst for Furo compatibility and edit  content
diff --git a/_docs/docs/source/architecture.rst b/_docs/docs/source/architecture.rst
@@ -0,0 +1,48 @@
+.. _architecture:
+
+Architecture & Design Overview
+******************************
+
+This section describes the design rationale, algorithmic choices, assumptions, testing strategy, and contribution process used in the DataProfiler library.
+
+Overview
+--------
+
+DataProfiler computes numeric statistics (e.g., mean, variance, skewness, kurtosis) using **streaming algorithms** that allow efficient, incremental updates without recomputing from raw data. Approximate quantile metrics like the median are calculated using histogram-based estimation, making the system scalable for large or streaming datasets.
+
+Additionally, DataProfiler uses a **Convolutional Neural Network (CNN)** to detect and label entities (e.g., names, emails, credit cards) in unstructured text. This supports critical tasks such as **PII detection**, **schema inference**, and **data quality analysis** across structured and unstructured data.
+
+Algorithm Rationale
+-------------------
+
+The algorithms used are designed for **speed, scalability, and flexibility**:
+
+- **Streaming numeric methods** (e.g., Welford's algorithm, moment-based metrics, histogram binning) efficiently summarize data without full recomputation.
+- **CNNs for entity detection** are fast, high-throughput, and well-suited for sequence labeling tasks in production environments.
+
+These choices align with the tool's goal of delivering fast, accurate data profiling with minimal configuration.
+
+Assumptions & Limitations
+-------------------------
+
+- **Consistent formatting** of sensitive entities is assumed (e.g., standardized credit card or SSN formats).
+- **Overlapping entity types** (e.g., phone vs. SSN) may lead to misclassification without context.
+- **Synthetic training data** may not fully capture real-world diversity, reducing model accuracy on natural or unstructured text.
+- **Quantile estimation** (e.g., median) is approximate and based on binning rather than exact sorting.
+
+Testing & Validation
+--------------------
+
+- Comprehensive **unit testing** is performed across Python 3.9, 3.10, and 3.11.
+- Tests are executed on every pull request targeting `dev` or `main` branches.
+- All pull requests require **two code reviewer approvals** before merging.
+- Testing includes correctness, performance, and compatibility checks to ensure production readiness.
+
+Versioning & Contributions
+--------------------------
+
+- Versioning and development are managed via **GitHub**.
+- Future changes must follow the guidelines in `CONTRIBUTING.md`, including:
+  - Forking the repo and branching from `dev` or an active feature branch.
+  - Ensuring **80%+ unit test coverage** for all new functionality.
+  - Opening a PR and securing **two approvals** prior to merging.
diff --git a/_docs/docs/source/index.rst b/_docs/docs/source/index.rst
@@ -451,6 +451,7 @@ In addition, it utilizes only the first 10,000 rows.
    profiler.rst
    data_labeling.rst
    graphs.rst
+   architecture.rst
 
 .. toctree::
    :maxdepth: 2