-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
As discussed with Coralie in the meeting of this morning, we have received two new batches of data since the start of impresso 2: one at the end of 2024 and one during the 2025 autumn.
The goal of this issue is to update the detect.py andclasses.py modules of the lux submodule (corresponding to BNL data - maybe should be renamed).
More detailed explainations and instructions will come, especially for the updated to do for classes.py, but for a first overview;
As part of this issue the very broad tasks are:
- Updates to
detect.py- Identifying and describing in this issue the specific file structure of each batch of data (original in
BNL, end 2024 inBNL-new/impresso, latest inBNL-new/2025-impresso-3) - Creating a json file listing the path to each issue for each alias and year/month
- updating the detect/select functions accordingly
- Identifying which new data was provided, what data might be missing compared to the media list etc
- Updating the columns Q of the list, and updating the metadata spreadsheet by creating a new tab for BNL, and moving the previous data (until now in the impresso1-collection) inside this new tab.
- Maybe more tasks based on what is uncovered
- Identifying and describing in this issue the specific file structure of each batch of data (original in
- Updates to
classes.py- Check on the app if the article segmentation and logical structure seems correct, take screenshots and copy urls here if not so that what and where to look for.
- Check the existing issues mentioning BNL data (or checks to do on all data) which might orient the updates to do
- [BNL - Lux importer] Investigate and fix the logical matching of physical articles and content-items #130
- BNL hierarchical structure in sub-articles and our "flat" title /content model leaves main sections out; need for documentation #166
- Image - Article linking in canonical data #165 -> check if the images can be linked to their corresponding article
- [Canonical Pages] - Patch existing data to add fascimile height and width #156 -> add facsimile height and width (like in BCUL case)
- [canonical - rebuilt] Hyphenation - état des lieux and improvement #143 -> ensure the hyphens are correctly identified
- Filtering UNKNOWN/UNTITLED as markers for non-existing titles #140 -> check the article titles, and maybe introduce a filtering/regex or other approaches to not include titles if they'r not real ones
- Add the ark ID of issues in the legacy data. Ensure the legacy data meets our needs
- Check the new versions of the data still work with the existing importer
If needed I'll add more info on each of theses tasks, but in general it ties back quite closely to the work done for BCUL.
Metadata
Metadata
Assignees
Labels
No labels