|
| 1 | +.. _architecture: |
| 2 | + |
| 3 | +Architecture & Design Overview |
| 4 | +****************************** |
| 5 | + |
| 6 | +This section describes the design rationale, algorithmic choices, assumptions, testing strategy, and contribution process used in the DataProfiler library. |
| 7 | + |
| 8 | +Overview |
| 9 | +-------- |
| 10 | + |
| 11 | +DataProfiler computes numeric statistics (e.g., mean, variance, skewness, kurtosis) using **streaming algorithms** that allow efficient, incremental updates without recomputing from raw data. Approximate quantile metrics like the median are calculated using histogram-based estimation, making the system scalable for large or streaming datasets. |
| 12 | + |
| 13 | +Additionally, DataProfiler uses a **Convolutional Neural Network (CNN)** to detect and label entities (e.g., names, emails, credit cards) in unstructured text. This supports critical tasks such as **PII detection**, **schema inference**, and **data quality analysis** across structured and unstructured data. |
| 14 | + |
| 15 | +Algorithm Rationale |
| 16 | +------------------- |
| 17 | + |
| 18 | +The algorithms used are designed for **speed, scalability, and flexibility**: |
| 19 | + |
| 20 | +- **Streaming numeric methods** (e.g., Welford's algorithm, moment-based metrics, histogram binning) efficiently summarize data without full recomputation. |
| 21 | +- **CNNs for entity detection** are fast, high-throughput, and well-suited for sequence labeling tasks in production environments. |
| 22 | + |
| 23 | +These choices align with the tool's goal of delivering fast, accurate data profiling with minimal configuration. |
| 24 | + |
| 25 | +Assumptions & Limitations |
| 26 | +------------------------- |
| 27 | + |
| 28 | +- **Consistent formatting** of sensitive entities is assumed (e.g., standardized credit card or SSN formats). |
| 29 | +- **Overlapping entity types** (e.g., phone vs. SSN) may lead to misclassification without context. |
| 30 | +- **Synthetic training data** may not fully capture real-world diversity, reducing model accuracy on natural or unstructured text. |
| 31 | +- **Quantile estimation** (e.g., median) is approximate and based on binning rather than exact sorting. |
| 32 | + |
| 33 | +Testing & Validation |
| 34 | +-------------------- |
| 35 | + |
| 36 | +- Comprehensive **unit testing** is performed across Python 3.9, 3.10, and 3.11. |
| 37 | +- Tests are executed on every pull request targeting `dev` or `main` branches. |
| 38 | +- All pull requests require **two code reviewer approvals** before merging. |
| 39 | +- Testing includes correctness, performance, and compatibility checks to ensure production readiness. |
| 40 | + |
| 41 | +Versioning & Contributions |
| 42 | +-------------------------- |
| 43 | + |
| 44 | +- Versioning and development are managed via **GitHub**. |
| 45 | +- Future changes must follow the guidelines in `CONTRIBUTING.md`, including: |
| 46 | + - Forking the repo and branching from `dev` or an active feature branch. |
| 47 | + - Ensuring **80%+ unit test coverage** for all new functionality. |
| 48 | + - Opening a PR and securing **two approvals** prior to merging. |
0 commit comments