Skip to content

Commit 1a84430

Browse files
authored
docs: add architecture.rst for algorithm rationale, testing, versioning (#1181)
* docs: add architecture.rst for algorithm rationale, testing, and versioning details * docs: remove manual table of contents from architecture.rst for Furo compatibility and edit content
1 parent 5a1ee0a commit 1a84430

File tree

2 files changed

+49
-0
lines changed

2 files changed

+49
-0
lines changed

_docs/docs/source/architecture.rst

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
.. _architecture:
2+
3+
Architecture & Design Overview
4+
******************************
5+
6+
This section describes the design rationale, algorithmic choices, assumptions, testing strategy, and contribution process used in the DataProfiler library.
7+
8+
Overview
9+
--------
10+
11+
DataProfiler computes numeric statistics (e.g., mean, variance, skewness, kurtosis) using **streaming algorithms** that allow efficient, incremental updates without recomputing from raw data. Approximate quantile metrics like the median are calculated using histogram-based estimation, making the system scalable for large or streaming datasets.
12+
13+
Additionally, DataProfiler uses a **Convolutional Neural Network (CNN)** to detect and label entities (e.g., names, emails, credit cards) in unstructured text. This supports critical tasks such as **PII detection**, **schema inference**, and **data quality analysis** across structured and unstructured data.
14+
15+
Algorithm Rationale
16+
-------------------
17+
18+
The algorithms used are designed for **speed, scalability, and flexibility**:
19+
20+
- **Streaming numeric methods** (e.g., Welford's algorithm, moment-based metrics, histogram binning) efficiently summarize data without full recomputation.
21+
- **CNNs for entity detection** are fast, high-throughput, and well-suited for sequence labeling tasks in production environments.
22+
23+
These choices align with the tool's goal of delivering fast, accurate data profiling with minimal configuration.
24+
25+
Assumptions & Limitations
26+
-------------------------
27+
28+
- **Consistent formatting** of sensitive entities is assumed (e.g., standardized credit card or SSN formats).
29+
- **Overlapping entity types** (e.g., phone vs. SSN) may lead to misclassification without context.
30+
- **Synthetic training data** may not fully capture real-world diversity, reducing model accuracy on natural or unstructured text.
31+
- **Quantile estimation** (e.g., median) is approximate and based on binning rather than exact sorting.
32+
33+
Testing & Validation
34+
--------------------
35+
36+
- Comprehensive **unit testing** is performed across Python 3.9, 3.10, and 3.11.
37+
- Tests are executed on every pull request targeting `dev` or `main` branches.
38+
- All pull requests require **two code reviewer approvals** before merging.
39+
- Testing includes correctness, performance, and compatibility checks to ensure production readiness.
40+
41+
Versioning & Contributions
42+
--------------------------
43+
44+
- Versioning and development are managed via **GitHub**.
45+
- Future changes must follow the guidelines in `CONTRIBUTING.md`, including:
46+
- Forking the repo and branching from `dev` or an active feature branch.
47+
- Ensuring **80%+ unit test coverage** for all new functionality.
48+
- Opening a PR and securing **two approvals** prior to merging.

_docs/docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -451,6 +451,7 @@ In addition, it utilizes only the first 10,000 rows.
451451
profiler.rst
452452
data_labeling.rst
453453
graphs.rst
454+
architecture.rst
454455

455456
.. toctree::
456457
:maxdepth: 2

0 commit comments

Comments
 (0)