[BNL - Coralie] - Update BNL importer

As discussed with Coralie in the meeting of this morning, we have received two new batches of data since the start of impresso 2: one at the end of 2024 and one during the 2025 autumn. 

The goal of this issue is to update the `detect.py` and`classes.py` modules of the `lux` submodule (corresponding to BNL data - maybe should be renamed).

More detailed explainations and instructions will come, especially for the updated to do for `classes.py`, but for a first overview;
As part of this issue the very broad tasks are:
- [ ] Updates to `detect.py`
  - [x] Identifying and describing in this issue the specific file structure of each batch of data (original in `BNL`, end 2024 in `BNL-new/impresso`, latest in `BNL-new/2025-impresso-3`)
  - [x] Creating a json file listing the path to each issue for each alias and year/month
  - [x] updating the detect/select functions accordingly
  - [x] Identifying which new data was provided, what data might be missing compared to the [media list](https://docs.google.com/spreadsheets/d/1nPHaTPDfwkpC91P9b6EnKcmJdzfOrysk1yu9i21geSI/edit?gid=1326750059#gid=1326750059) etc
  - [x] Updating the columns Q of the list, and updating the [metadata](https://docs.google.com/spreadsheets/d/1jkW6cuINgT7SpuvJE7jVW4lpWiCypVuFiQOhuDZ_o1E/edit?gid=1674451346#gid=1674451346) spreadsheet by creating a new tab for BNL, and moving the previous data (until now in the impresso1-collection) inside this new tab.
  - [ ] Maybe more tasks based on what is uncovered
- [ ] Updates to `classes.py`
  - [ ] Check on the app if the article segmentation and logical structure seems correct, take screenshots and copy urls here if not so that what and where to look for.
  - [ ] Check the existing issues mentioning BNL data (or checks to do on all data) which might orient the updates to do
    - [x] #130 
    - [ ] #166 
    - [ ] #165 -> check if the images can be linked to their corresponding article
    - [x] #156 -> add facsimile height and width (like in BCUL case)
    - [ ] #143 -> ensure the hyphens are correctly identified
    - [ ] #140 -> check the article titles, and maybe introduce a filtering/regex or other approaches to not include titles if they'r not real ones
  - [ ] Add the ark ID of issues in the legacy data. Ensure the legacy data meets our needs
  - [ ] Check the new versions of the data still work with the existing importer

If needed I'll add more info on each of theses tasks, but in general it ties back quite closely to the work done for BCUL.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BNL - Coralie] - Update BNL importer #174

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BNL - Coralie] - Update BNL importer #174

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions