[Glebs] - SUB Canonical Importer

This issue describes the tasks and steps which compose the creation of a new SUB importer in the `text_preparation.importer` module.
The idea is to delegate this task to @maslionok as it is quite self contained. 

## General description of our needs

We would need to implement the classes which import SUB's OCR/OLR data into our canonical format on S3. 
This is done by the `text_preparation.importer` module in two main parts:
- Classes and detect functions - in the `classes.py` and `detect.py` scripts of the `importers.sub` module
  - `classes.py` define the Canonical Issue and Canonical Page classes for the SUB case, taking into account all the specificities of this exact version of mets/alto OCR/OLR
  - `detect.py` define the functions used to identify all the issues to ingest inside the file system, using "IssueDir" objects, which are named tuples holding all the main information identifying an issue. 
  - Both of these scripts are very similar from importer to importer, and the code from other METS/ALTO based importers can (and should!) completely be used as guide and example, especially the code of BL, BNF, BNF-EN.
- Orchestators `generic_importer.py` and `core.py`.
  -  These are the main orchestrators of the imports. they are not to be modified too much as they work in the same way for all importers. Reading them can allow to understand better how everything works (eg. how issues are serialized first, the lazy behavior of page objects, error logging etc). It might be that there are errors in these scripts, but it would be best to come to me about them as this impacts all canonical imports.

Much more precise information can be found in the [documentation](https://impresso-text-importer.readthedocs.io/en/latest/) for the project, but don't hesitate to reach out and ask if there is anything which is unclear

Here are the [json schemas](https://github.com/impresso/impresso-schemas/tree/7ef8f0f501255a44bae4a376488cbdfa1a28ffbc/json/canonical) the pages and issues follow.

## Current Situation and File Structure
The current situation is the following: we have received from the SUB the full contents of the "Hamburger Echo" newspaper title, spanning from 1887 to 1933, and in `METS/ALTO` format.

The structure of the files is the following:
```
Hamburger_Echo/                          # Root directory (newspaper title)
├── 1919/                                # Year directory
│   ├── 02/                              # Month directory (February)
│   │   ├── 19/                          # Day directory (19th)
│   │   │   ├── Morgenausgabe/           # Morning edition
│   │   │   │   ├── 00000001.tif         # Page image (facsimile)
│   │   │   │   ├── 00000001.xml         # Page OCR (ALTO or PAGE XML)
│   │   │   │   ├── 00000002.tif
│   │   │   │   ├── 00000002.xml
│   │   │   │   ├── [...]                # More page pairs (.tif/.xml)
│   │   │   │   └── PPN1754726119_19190219MO.xml   # METS file (Morgenausgabe, “MO”)
│   │   │   │                                      # Format: PPN[titleID]_[YYYYMMDD][edition_code].xml
│   │   │   │                                      # Here: PPN1754726119 = newspaper ID
│   │   │   │                                      #        19190219 = date in YYYMMDD format
│   │   │   │                                      #        MO = morning edition (Morgenausgabe)
│   │   │   │
│   │   │   ├── Abendausgabe/            # Evening edition (same day)
│   │   │   │   ├── 00000001.tif
│   │   │   │   ├── 00000001.xml
│   │   │   │   ├── [...]                # More pages
│   │   │   │   └── PPN1754726119_19190219AB.xml   # METS file (Abendausgabe, “AB”)
│   │   │   │
│   │   │   └── [other editions or none for this day...]
│   │   │
│   │   ├── 20/                          # Next day (example of multiple evening editions)
│   │   │   ├── A1-Abendausgabe/
│   │   │   │   ├── 00000001.tif
│   │   │   │   ├── 00000001.xml
│   │   │   │   ├── [...]
│   │   │   │   └── PPN1754726119_19190220A1.xml   # METS file (first evening edition)
│   │   │   │
│   │   │   ├── A2-Abendausgabe/
│   │   │   │   ├── 00000001.tif
│   │   │   │   ├── 00000001.xml
│   │   │   │   ├── [...]
│   │   │   │   └── PPN1754726119_19190220A2.xml   # METS file (second evening edition)
│   │   │
│   │   ├── 21/                          # Example of a single daily edition
│   │   │   └── Ausgabe/
│   │   │       ├── 00000001.tif
│   │   │       ├── 00000001.xml
│   │   │       ├── [...]
│   │   │       └── PPN1754726119_19190221.xml     # METS file (single daily edition)
│   │   │
│   │   └── [other days...]
│   │
│   └── [other months...]
│
└── [other years...]
```

Each level represents:
- Title → Hamburger_Echo (newspaper)
- Year → e.g. 1919
- Month → e.g. 02
- Day → e.g. 19, 20, 21
- Edition → one of:
  - Ausgabe → single daily edition (it seems)
  - Morgenausgabe → morning edition (MO)
  - Abendausgabe → evening edition (AB)
  - A1-Abendausgabe, A2-Abendausgabe → multiple evening editions (A1, A2) - probably can also exist for morning
  -> it can be that there the morning or evening edition is the only one of one day. From what I have seen, whenever we have "Ausgabe" it's always the only edition of the day, but this may not apply to the entirety of the data.

This structure can be directly and easily used to construct the Impresso IDs of each issue (and page): 
`[media-alias]-[YYYY]-[MM]-[DD]-[edition letter(s)]` (`[media-alias]-[YYYY]-[MM]-[DD]-[edition letter(s)]-p[page number filled to 4 digits]` for pages).
--> Note that the editions are simply a letter assigned, 'a' for the first, 'b' for the second etc.

Here we will assign the alias `"hamb_echo"` to the Hamburger Echo - and decide later what to assign to other titles

## More precise tasks and questions left to discussion

I have already added some skeletons for the classes and scripts to implement in the feature/sub-importer branch.
- [ ] defining the `SubNewspaperIssue` object
- [ ] defining the `SubNewspaperPage` object
- [ ] defining the functions in the `detect.py` script
- [ ] modularization of any helper function in a `helpers.py` script

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Glebs] - SUB Canonical Importer #173

General description of our needs

Current Situation and File Structure

More precise tasks and questions left to discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Glebs] - SUB Canonical Importer #173

Description

General description of our needs

Current Situation and File Structure

More precise tasks and questions left to discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions