Multi-CPU pdfminer.six emulation is too slow on small documents #3

Closed
opened 2025-01-17 11:29:50 -05:00 by dhdaines · 1 comment
dhdaines commented 2025-01-17 11:29:50 -05:00 (Migrated from github.com)
$ python benchmarks/miner.py -n 1 tests/contrib/PSC_Station.pdf
PAVÉS (1 CPUs) took 0.43s
pdfminer.six (single) took 0.46s
$ python benchmarks/miner.py -n 2 tests/contrib/PSC_Station.pdf
PAVÉS (2 CPUs) took 0.66s
pdfminer.six (single) took 0.45s

This is entirely due to the overhead of removing and restoring weak references in indirect object references. For the "removing" this will no longer be necessary in PLAYA-PDF 0.2.8 but the "restoring" is still going to be slow.

Possibly we need to do a custom serializer/deserializer or something? Does ProcessPoolExecutor allow this?

```console $ python benchmarks/miner.py -n 1 tests/contrib/PSC_Station.pdf PAVÉS (1 CPUs) took 0.43s pdfminer.six (single) took 0.46s $ python benchmarks/miner.py -n 2 tests/contrib/PSC_Station.pdf PAVÉS (2 CPUs) took 0.66s pdfminer.six (single) took 0.45s ``` This is entirely due to the overhead of removing and restoring weak references in indirect object references. For the "removing" this will no longer be necessary in PLAYA-PDF 0.2.8 but the "restoring" is still going to be slow. Possibly we need to do a custom serializer/deserializer or something? Does ProcessPoolExecutor allow this?
dhdaines commented 2025-01-22 08:54:03 -05:00 (Migrated from github.com)

Fixed by #4 (why you not close github?!?!?1!6!)

Fixed by #4 (why you not close github?!?!?1!6!)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
dhd/paves#3
No description provided.