Skip to content

[Glebs] - SUB Canonical Importer #173

@piconti

Description

@piconti

This issue describes the tasks and steps which compose the creation of a new SUB importer in the text_preparation.importer module.
The idea is to delegate this task to @maslionok as it is quite self contained.

General description of our needs

We would need to implement the classes which import SUB's OCR/OLR data into our canonical format on S3.
This is done by the text_preparation.importer module in two main parts:

  • Classes and detect functions - in the classes.py and detect.py scripts of the importers.sub module
    • classes.py define the Canonical Issue and Canonical Page classes for the SUB case, taking into account all the specificities of this exact version of mets/alto OCR/OLR
    • detect.py define the functions used to identify all the issues to ingest inside the file system, using "IssueDir" objects, which are named tuples holding all the main information identifying an issue.
    • Both of these scripts are very similar from importer to importer, and the code from other METS/ALTO based importers can (and should!) completely be used as guide and example, especially the code of BL, BNF, BNF-EN.
  • Orchestators generic_importer.py and core.py.
    • These are the main orchestrators of the imports. they are not to be modified too much as they work in the same way for all importers. Reading them can allow to understand better how everything works (eg. how issues are serialized first, the lazy behavior of page objects, error logging etc). It might be that there are errors in these scripts, but it would be best to come to me about them as this impacts all canonical imports.

Much more precise information can be found in the documentation for the project, but don't hesitate to reach out and ask if there is anything which is unclear

Here are the json schemas the pages and issues follow.

Current Situation and File Structure

The current situation is the following: we have received from the SUB the full contents of the "Hamburger Echo" newspaper title, spanning from 1887 to 1933, and in METS/ALTO format.

The structure of the files is the following:

Hamburger_Echo/                          # Root directory (newspaper title)
├── 1919/                                # Year directory
│   ├── 02/                              # Month directory (February)
│   │   ├── 19/                          # Day directory (19th)
│   │   │   ├── Morgenausgabe/           # Morning edition
│   │   │   │   ├── 00000001.tif         # Page image (facsimile)
│   │   │   │   ├── 00000001.xml         # Page OCR (ALTO or PAGE XML)
│   │   │   │   ├── 00000002.tif
│   │   │   │   ├── 00000002.xml
│   │   │   │   ├── [...]                # More page pairs (.tif/.xml)
│   │   │   │   └── PPN1754726119_19190219MO.xml   # METS file (Morgenausgabe, “MO”)
│   │   │   │                                      # Format: PPN[titleID]_[YYYYMMDD][edition_code].xml
│   │   │   │                                      # Here: PPN1754726119 = newspaper ID
│   │   │   │                                      #        19190219 = date in YYYMMDD format
│   │   │   │                                      #        MO = morning edition (Morgenausgabe)
│   │   │   │
│   │   │   ├── Abendausgabe/            # Evening edition (same day)
│   │   │   │   ├── 00000001.tif
│   │   │   │   ├── 00000001.xml
│   │   │   │   ├── [...]                # More pages
│   │   │   │   └── PPN1754726119_19190219AB.xml   # METS file (Abendausgabe, “AB”)
│   │   │   │
│   │   │   └── [other editions or none for this day...]
│   │   │
│   │   ├── 20/                          # Next day (example of multiple evening editions)
│   │   │   ├── A1-Abendausgabe/
│   │   │   │   ├── 00000001.tif
│   │   │   │   ├── 00000001.xml
│   │   │   │   ├── [...]
│   │   │   │   └── PPN1754726119_19190220A1.xml   # METS file (first evening edition)
│   │   │   │
│   │   │   ├── A2-Abendausgabe/
│   │   │   │   ├── 00000001.tif
│   │   │   │   ├── 00000001.xml
│   │   │   │   ├── [...]
│   │   │   │   └── PPN1754726119_19190220A2.xml   # METS file (second evening edition)
│   │   │
│   │   ├── 21/                          # Example of a single daily edition
│   │   │   └── Ausgabe/
│   │   │       ├── 00000001.tif
│   │   │       ├── 00000001.xml
│   │   │       ├── [...]
│   │   │       └── PPN1754726119_19190221.xml     # METS file (single daily edition)
│   │   │
│   │   └── [other days...]
│   │
│   └── [other months...]
│
└── [other years...]

Each level represents:

  • Title → Hamburger_Echo (newspaper)
  • Year → e.g. 1919
  • Month → e.g. 02
  • Day → e.g. 19, 20, 21
  • Edition → one of:
    • Ausgabe → single daily edition (it seems)
    • Morgenausgabe → morning edition (MO)
    • Abendausgabe → evening edition (AB)
    • A1-Abendausgabe, A2-Abendausgabe → multiple evening editions (A1, A2) - probably can also exist for morning
      -> it can be that there the morning or evening edition is the only one of one day. From what I have seen, whenever we have "Ausgabe" it's always the only edition of the day, but this may not apply to the entirety of the data.

This structure can be directly and easily used to construct the Impresso IDs of each issue (and page):
[media-alias]-[YYYY]-[MM]-[DD]-[edition letter(s)] ([media-alias]-[YYYY]-[MM]-[DD]-[edition letter(s)]-p[page number filled to 4 digits] for pages).
--> Note that the editions are simply a letter assigned, 'a' for the first, 'b' for the second etc.

Here we will assign the alias "hamb_echo" to the Hamburger Echo - and decide later what to assign to other titles

More precise tasks and questions left to discussion

I have already added some skeletons for the classes and scripts to implement in the feature/sub-importer branch.

  • defining the SubNewspaperIssue object
  • defining the SubNewspaperPage object
  • defining the functions in the detect.py script
  • modularization of any helper function in a helpers.py script

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions