Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove usage of os.path and path.py, prefer pathlib #195

Closed
benoit74 opened this issue Aug 18, 2023 · 24 comments · Fixed by #225
Closed

Remove usage of os.path and path.py, prefer pathlib #195

benoit74 opened this issue Aug 18, 2023 · 24 comments · Fixed by #225

Comments

@benoit74
Copy link
Collaborator

benoit74 commented Aug 18, 2023

We use both os.path, path.py and pathlib in the codebase.

This is pretty confusing in many situation. We should use only one, and pathlib is the one to choose.

@benoit74 benoit74 changed the title Remove usage of os.path, prefer pathlib Remove usage of os.path and path.py, prefer pathlib Aug 18, 2023
@benoit74 benoit74 added this to the 2.2.0 milestone Aug 18, 2023
@MUCCHU
Copy link

MUCCHU commented Dec 20, 2023

Hey, I would like to take up this issue.

@rgaudin
Copy link
Member

rgaudin commented Dec 20, 2023

👍

@MUCCHU
Copy link

MUCCHU commented Dec 20, 2023

I have opened a Pull request #208 to solve the issue. Kindly go through it. Thanks!

@ImperaLuna
Copy link

Is this issue still open? I would like to work on it.

@benoit74 benoit74 assigned ImperaLuna and unassigned MUCCHU Mar 7, 2024
@benoit74
Copy link
Collaborator Author

benoit74 commented Mar 7, 2024

Yes, still open, I assigned you the issue. I had to close previous PR because code was not working and committer vanished. You can probably get started with the code in this PR and fixed what had to be. Important is to execute on full run of the scraper to ensure it is working properly. I mentioned the command I used to test in the PR.

@ImperaLuna
Copy link

Should I also replace os.unlink with unlink method from Path? I've looked over 3-4 modules so far and it seems to be the only place we are still using os .

@benoit74
Copy link
Collaborator Author

benoit74 commented Mar 7, 2024

Yes, this would make sense.

@ImperaLuna
Copy link

ImperaLuna commented Mar 7, 2024

I've wrote most of the changes however I'm having troubles testing the script.

I've used python src/gutenberg2zim and ran all the way to id 90k+ and started downloading but I ran out of local memory upsy.

However python src/gutenberg2zim --books 84 & ./gutenberg2zim -l en,fr -f pdf --books 100-200 --bookshelves --title-search . Take a very long time(10-20+ minutes to run into a error).

I made a clean WSL env on debian and cloned the main fork again to see how it's supposed to work normally but I also have problems there:

And at this point I'm concerned I'm not even able to properly set up de dev env...

[gutenberg2zim.constants::2024-03-07 23:10:38,587] INFO:CHECKING for dependencies on the system
[gutenberg2zim.constants::2024-03-07 23:10:38,592] INFO:PREPARING rdf-files cache from http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:38,593] INFO:        rdf-files archive already exists in /mnt/d/python/test/gutenberg/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:38,593] INFO:SETTING UP DATABASE
[gutenberg2zim.constants::2024-03-07 23:10:38,593] INFO:Setting up the database
[gutenberg2zim.constants::2024-03-07 23:10:38,593] DEBUG:license table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,594] DEBUG:author table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,594] DEBUG:book table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,594] DEBUG:bookformat table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,595] DEBUG:url table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,595] INFO:PARSING rdf-files in /mnt/d/python/test/gutenberg/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:38,595] INFO:        Looping throught RDF files in /mnt/d/python/test/gutenberg/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:58,405] INFO:        Skipping already parsed file cache/epub/84/pg84.rdf for book id 84
[gutenberg2zim.constants::2024-03-07 23:10:58,897] INFO:Add possible url to db
[gutenberg2zim.constants::2024-03-07 23:10:58,898] INFO:        Urls rsync result tmp/file_on_aleph_pglaf_org already exists, processing existing file
[gutenberg2zim.constants::2024-03-07 23:10:58,898] INFO:        Looking after relative path start in urls rsync result
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/mnt/d/python/test/gutenberg/src/gutenberg2zim/__main__.py", line 4, in <module>
    main()
  File "/home/imperaluna/.local/share/hatch/env/virtual/gutenberg2zim/6b97C51F/gutenberg2zim/lib/python3.11/site-packages/gutenberg2zim/entrypoint.py", line 181, in main
    setup_urls(force=force)
  File "/home/imperaluna/.local/share/hatch/env/virtual/gutenberg2zim/6b97C51F/gutenberg2zim/lib/python3.11/site-packages/gutenberg2zim/urls.py", line 256, in setup_urls
    if start_rel_path_idx == -1:  # type: ignore
       ^^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'start_rel_path_idx' where it is not associated with a value

I even tried to make a docker image but it i had to ctrl+c it after about 1h

docker run my-image gutenberg2zim --books 84
[gutenberg2zim.constants::2024-03-07 21:16:20,934] INFO:Add possible url to db
[gutenberg2zim.constants::2024-03-07 21:16:20,934] DEBUG:bash -c rsync -a --list-only rsync://aleph.pglaf.org/gutenberg/ > tmp/file_on_aleph_pglaf_org
[gutenberg2zim.constants::2024-03-07 21:36:15,318] INFO:        Looking after relative path start in urls rsync result
[gutenberg2zim.constants::2024-03-07 21:36:16,023] INFO:        Removing all urls already present in DB
[gutenberg2zim.constants::2024-03-07 21:36:16,027] INFO:        Appending urls in DB from rsync result

@elfkuzco
Copy link
Contributor

I would like to fix this @benoit74

@benoit74
Copy link
Collaborator Author

@elfkuzco please do ^^

@benoit74 benoit74 assigned elfkuzco and unassigned ImperaLuna Mar 26, 2024
@elfkuzco
Copy link
Contributor

elfkuzco commented Apr 5, 2024

Will probably take a bit longer than I expected because in some places, the code expects a string and passing an instance of Path (from pathlib) breaks things when I try to test. My idea is to add type annotations to it too rather than resolving the paths everywhere they are called.

@benoit74
Copy link
Collaborator Author

benoit74 commented Apr 5, 2024

Makes sense

@elfkuzco
Copy link
Contributor

elfkuzco commented Apr 9, 2024

@benoit74 , I am almost done with this fix and I have tested the command python src/gutenberg2zim --books 84 and everything works fine (except that the paths displayed in the logs are fully resolved paths).

However, before I run more tests and submit a PR, I would like to point out that the usage of the os.path in the urls.py is fine and shouldn't be switched to pathlib.Path because those are actual strings (just urls) and have nothing to do with files. Using pathlib.Path here doesn't really seem "clean" as it will involve calling str on it too after building the string.

For example,

url = os.path.join("a", "b")

and

url = str(Path("a") / "b")

both achieve the same thing. But using Path seems a little bit harder to read given none of the actual Path operations are ever invoked in this module.

What do you think?

@benoit74
Copy link
Collaborator Author

Great! Your point is important, thank you for asking. This is in fact even potentially a bug on Windows platforms.

First of all, I'm quite sure the code of your example does not achieve the same thing on Windows platform. You need to use the PurePosixPath class to achieve the same thing (pure because we do not intend to perform any real I/O with this path, posix because we always want / as separator even on Windows platforms).

And this can then be more neatly written like PurePosixPath("a", "b") (hence being quite close to os.path.join). You can even write stuff like PurePosixPath(*list_of_paths, "c") (instead of calling PurePosixPath multiple times like it was done for os.path.join.

@elfkuzco
Copy link
Contributor

I tried to work with PurePosixPath and I noticed that if you call it on a url which has http:// (two consecutive slashes), it strips one off and this in turn causes unintended effects in the rest of the code. Example:

p = PurePosixPath("http://aleph.pglaf.org/cache/epub/84/pg84.epub")

yields

PurePosixPath('http:/aleph.pglaf.org/cache/epub/84/pg84.epub')

Notice that one of the slashes after the http: is gone. I think why it works with os.path is because it simply concatenates the paths "intelligently" using a path seperator. Same issue will arise even if we use Path.

In light of these observations, my opinion is to either leave the implementation as it is with os.path or default to building the strings by adding the slash operator explicitly.
The latter is my preferred approach as it is obvious to any reader that they are working with strings. Plus, there's no telling if the implementation of the os.path.join will change in future versions of the language.

@benoit74
Copy link
Collaborator Author

Oh, if it is a real URL, then it is not a path (ok, that's obvious maybe ^^)

My recommendation would be to still get rid of os.path.join, this is purely misleading. I would consider to use a mix of PurePosixPath (to join the parts of url path or subpath) and simple string concatenation.

If I'm not mistaken,

base_url = os.path.join(
    os.path.join(*list(str(self.b_id))[:-1]), str(self.b_id)
)

would become

base_url = PurePosixPath(*list(str(self.b_id))[:-1]), str(self.b_id))

And

url = os.path.join(self.base, base_url)

would become

url = self.base + base_url

This means we have a convention which says that self.base (and other bases) MUST end with a slash (not a big concern AFAIK, we just have to fix BASE_THREE) and path must not start with a slash (but PurePosixPath never adds it magically)

WDYT of this? I do not mind if you consider it is too cumbersome to do, this is just a proposition to enhance the codebase, not something ultra important.

@elfkuzco
Copy link
Contributor

In my implementation, I used f-strings completely and got rid of Path completely. And I chose the convention that paths end without a slash, so that I have to explicitly add the / myself. This, I think is more obvious to a reader of the code.

Here's a snippet of my refactoring for the build method:

        if self.base == self.BASE_ONE:
            if int(self.b_id) > 10:  # noqa: PLR2004
                components = "/".join(self.b_id[:-1])
                base_url = f"{components}/{self.b_id}"
            else:
                base_url = f"0/{self.b_id}"
            url = f"{self.base}/{base_url}"
        elif self.base == self.BASE_TWO:
            url = f"{self.base}/{self.b_id}"
        elif self.base == self.BASE_THREE:
            url = self.base
        return url

And here's where it is being used in building epub urls:

    name = "".join(["pg", b_id])
    url = f"{u.build()}/{name}.epub"
    url_images = f"{u.build()}/{name}-images.epub"
    url_noimages = f"{u.build()}/{name}-noimages.epub"
    urls.extend([url, url_images, url_noimages])

Aside from this, I think everything works completely. I have run the command for python src/gutenberg2zim --books 84 and python3 src/gutenberg2zim -l en,fr -f pdf --books 100-200 --bookshelves --title-search and everything works fine. Here's a log file for the first command
beta.log

I am currently running python src/gutenberg2zim for the full run. If that works and you are okay with using f-strings as I have, I will submit the PR

@benoit74
Copy link
Collaborator Author

Yes, I'm fine with your proposition!

@eshellman
Copy link
Collaborator

Why are you trying to use pathlib on a URL? I think urllib.parse is what you want.

@benoit74
Copy link
Collaborator Author

I think we all agree that using path methods on a URL is a bad idea

urllib.parse.urljoin is unfortunately not helping much because AFAIK it does not automatically append "/" between URL path segments (which is all we need in fact).

I'm quite pleased with the idea proposed by @elfkuzco to simply use string manipulation for that indeed (I've not yet read the PR yet, but I now think that string manipulation is the best solution for current situation).

@rgaudin
Copy link
Member

rgaudin commented Apr 16, 2024

Pretty sure we're doing similar stuff in other scrapers. I believe we use Path (PurePosixPath would be better) on the path or a parsed URL. Please take a look at other scrapers (wikihow, sotoki?) as this seems pretty common behavior in scrapers

@benoit74
Copy link
Collaborator Author

I had a look at the PR and confirm I like the string manipulation proposed when manipulating URL path segments, it is way clearer than PurePosixPath or Path which are not working on URLs or os.path.join which is semantically wrong. See https://github.com/openzim/gutenberg/pull/225/files#diff-1c2c0aac8f278200c33b65c8b05dc7367b1687f64c7e8794a1b9da7b1ae5e369

@benoit74
Copy link
Collaborator Author

I had a look in sotoki and wikihow and failed to find any URL which is dynamically built by concatenating many path segments.

@rgaudin
Copy link
Member

rgaudin commented Apr 16, 2024

it is way clearer than PurePosixPath or Path which are not working on URLs

There's a clue in the name that those are not the same thing 😉
Though, a URL path (path element of an urslplit()) is as close as it gets to a PurePosixPath. It can be very handy in situation where you need to resolve targets from relative notation or ensure/compute targets relative to one another.

We may not need it here nor in any scraper. I thought we did in wikiHow but actually we dont.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment