-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove usage of os.path and path.py, prefer pathlib #195
Comments
Hey, I would like to take up this issue. |
👍 |
I have opened a Pull request #208 to solve the issue. Kindly go through it. Thanks! |
Is this issue still open? I would like to work on it. |
Yes, still open, I assigned you the issue. I had to close previous PR because code was not working and committer vanished. You can probably get started with the code in this PR and fixed what had to be. Important is to execute on full run of the scraper to ensure it is working properly. I mentioned the command I used to test in the PR. |
Should I also replace os.unlink with unlink method from Path? I've looked over 3-4 modules so far and it seems to be the only place we are still using os . |
Yes, this would make sense. |
I've wrote most of the changes however I'm having troubles testing the script. I've used python However I made a clean WSL env on debian and cloned the main fork again to see how it's supposed to work normally but I also have problems there: And at this point I'm concerned I'm not even able to properly set up de dev env... [gutenberg2zim.constants::2024-03-07 23:10:38,587] INFO:CHECKING for dependencies on the system
[gutenberg2zim.constants::2024-03-07 23:10:38,592] INFO:PREPARING rdf-files cache from http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:38,593] INFO: rdf-files archive already exists in /mnt/d/python/test/gutenberg/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:38,593] INFO:SETTING UP DATABASE
[gutenberg2zim.constants::2024-03-07 23:10:38,593] INFO:Setting up the database
[gutenberg2zim.constants::2024-03-07 23:10:38,593] DEBUG:license table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,594] DEBUG:author table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,594] DEBUG:book table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,594] DEBUG:bookformat table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,595] DEBUG:url table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,595] INFO:PARSING rdf-files in /mnt/d/python/test/gutenberg/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:38,595] INFO: Looping throught RDF files in /mnt/d/python/test/gutenberg/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:58,405] INFO: Skipping already parsed file cache/epub/84/pg84.rdf for book id 84
[gutenberg2zim.constants::2024-03-07 23:10:58,897] INFO:Add possible url to db
[gutenberg2zim.constants::2024-03-07 23:10:58,898] INFO: Urls rsync result tmp/file_on_aleph_pglaf_org already exists, processing existing file
[gutenberg2zim.constants::2024-03-07 23:10:58,898] INFO: Looking after relative path start in urls rsync result
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/mnt/d/python/test/gutenberg/src/gutenberg2zim/__main__.py", line 4, in <module>
main()
File "/home/imperaluna/.local/share/hatch/env/virtual/gutenberg2zim/6b97C51F/gutenberg2zim/lib/python3.11/site-packages/gutenberg2zim/entrypoint.py", line 181, in main
setup_urls(force=force)
File "/home/imperaluna/.local/share/hatch/env/virtual/gutenberg2zim/6b97C51F/gutenberg2zim/lib/python3.11/site-packages/gutenberg2zim/urls.py", line 256, in setup_urls
if start_rel_path_idx == -1: # type: ignore
^^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'start_rel_path_idx' where it is not associated with a value I even tried to make a docker image but it i had to ctrl+c it after about 1h docker run my-image gutenberg2zim --books 84
[gutenberg2zim.constants::2024-03-07 21:16:20,934] INFO:Add possible url to db
[gutenberg2zim.constants::2024-03-07 21:16:20,934] DEBUG:bash -c rsync -a --list-only rsync://aleph.pglaf.org/gutenberg/ > tmp/file_on_aleph_pglaf_org
[gutenberg2zim.constants::2024-03-07 21:36:15,318] INFO: Looking after relative path start in urls rsync result
[gutenberg2zim.constants::2024-03-07 21:36:16,023] INFO: Removing all urls already present in DB
[gutenberg2zim.constants::2024-03-07 21:36:16,027] INFO: Appending urls in DB from rsync result |
I would like to fix this @benoit74 |
@elfkuzco please do ^^ |
Will probably take a bit longer than I expected because in some places, the code expects a string and passing an instance of Path (from pathlib) breaks things when I try to test. My idea is to add type annotations to it too rather than resolving the paths everywhere they are called. |
Makes sense |
@benoit74 , I am almost done with this fix and I have tested the command However, before I run more tests and submit a PR, I would like to point out that the usage of the For example,
and
both achieve the same thing. But using What do you think? |
Great! Your point is important, thank you for asking. This is in fact even potentially a bug on Windows platforms. First of all, I'm quite sure the code of your example does not achieve the same thing on Windows platform. You need to use the And this can then be more neatly written like |
I tried to work with
yields
Notice that one of the slashes after the In light of these observations, my opinion is to either leave the implementation as it is with |
Oh, if it is a real URL, then it is not a path (ok, that's obvious maybe ^^) My recommendation would be to still get rid of If I'm not mistaken,
would become
And
would become
This means we have a convention which says that WDYT of this? I do not mind if you consider it is too cumbersome to do, this is just a proposition to enhance the codebase, not something ultra important. |
In my implementation, I used f-strings completely and got rid of Here's a snippet of my refactoring for the
And here's where it is being used in building epub urls:
Aside from this, I think everything works completely. I have run the command for I am currently running |
Yes, I'm fine with your proposition! |
Why are you trying to use pathlib on a URL? I think urllib.parse is what you want. |
I think we all agree that using path methods on a URL is a bad idea
I'm quite pleased with the idea proposed by @elfkuzco to simply use string manipulation for that indeed (I've not yet read the PR yet, but I now think that string manipulation is the best solution for current situation). |
Pretty sure we're doing similar stuff in other scrapers. I believe we use Path (PurePosixPath would be better) on the |
I had a look at the PR and confirm I like the string manipulation proposed when manipulating URL path segments, it is way clearer than |
I had a look in sotoki and wikihow and failed to find any URL which is dynamically built by concatenating many path segments. |
There's a clue in the name that those are not the same thing 😉 We may not need it here nor in any scraper. I thought we did in wikiHow but actually we dont. |
We use both
os.path
,path.py
andpathlib
in the codebase.This is pretty confusing in many situation. We should use only one, and
pathlib
is the one to choose.The text was updated successfully, but these errors were encountered: