Remove usage of os.path and path.py, prefer pathlib #195

benoit74 · 2023-08-18T08:46:44Z

We use both os.path, path.py and pathlib in the codebase.

This is pretty confusing in many situation. We should use only one, and pathlib is the one to choose.

The text was updated successfully, but these errors were encountered:

MUCCHU · 2023-12-20T06:41:28Z

Hey, I would like to take up this issue.

rgaudin · 2023-12-20T07:50:31Z

👍

MUCCHU · 2023-12-20T13:37:56Z

I have opened a Pull request #208 to solve the issue. Kindly go through it. Thanks!

ImperaLuna · 2024-03-07T04:10:49Z

Is this issue still open? I would like to work on it.

benoit74 · 2024-03-07T04:16:05Z

Yes, still open, I assigned you the issue. I had to close previous PR because code was not working and committer vanished. You can probably get started with the code in this PR and fixed what had to be. Important is to execute on full run of the scraper to ensure it is working properly. I mentioned the command I used to test in the PR.

ImperaLuna · 2024-03-07T12:45:08Z

Should I also replace os.unlink with unlink method from Path? I've looked over 3-4 modules so far and it seems to be the only place we are still using os .

benoit74 · 2024-03-07T12:49:21Z

Yes, this would make sense.

ImperaLuna · 2024-03-07T21:23:44Z

I've wrote most of the changes however I'm having troubles testing the script.

I've used python src/gutenberg2zim and ran all the way to id 90k+ and started downloading but I ran out of local memory upsy.

However python src/gutenberg2zim --books 84 & ./gutenberg2zim -l en,fr -f pdf --books 100-200 --bookshelves --title-search . Take a very long time(10-20+ minutes to run into a error).

I made a clean WSL env on debian and cloned the main fork again to see how it's supposed to work normally but I also have problems there:

And at this point I'm concerned I'm not even able to properly set up de dev env...

[gutenberg2zim.constants::2024-03-07 23:10:38,587] INFO:CHECKING for dependencies on the system
[gutenberg2zim.constants::2024-03-07 23:10:38,592] INFO:PREPARING rdf-files cache from http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:38,593] INFO:        rdf-files archive already exists in /mnt/d/python/test/gutenberg/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:38,593] INFO:SETTING UP DATABASE
[gutenberg2zim.constants::2024-03-07 23:10:38,593] INFO:Setting up the database
[gutenberg2zim.constants::2024-03-07 23:10:38,593] DEBUG:license table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,594] DEBUG:author table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,594] DEBUG:book table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,594] DEBUG:bookformat table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,595] DEBUG:url table already exists.
[gutenberg2zim.constants::2024-03-07 23:10:38,595] INFO:PARSING rdf-files in /mnt/d/python/test/gutenberg/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:38,595] INFO:        Looping throught RDF files in /mnt/d/python/test/gutenberg/rdf-files.tar.bz2
[gutenberg2zim.constants::2024-03-07 23:10:58,405] INFO:        Skipping already parsed file cache/epub/84/pg84.rdf for book id 84
[gutenberg2zim.constants::2024-03-07 23:10:58,897] INFO:Add possible url to db
[gutenberg2zim.constants::2024-03-07 23:10:58,898] INFO:        Urls rsync result tmp/file_on_aleph_pglaf_org already exists, processing existing file
[gutenberg2zim.constants::2024-03-07 23:10:58,898] INFO:        Looking after relative path start in urls rsync result
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/mnt/d/python/test/gutenberg/src/gutenberg2zim/__main__.py", line 4, in <module>
    main()
  File "/home/imperaluna/.local/share/hatch/env/virtual/gutenberg2zim/6b97C51F/gutenberg2zim/lib/python3.11/site-packages/gutenberg2zim/entrypoint.py", line 181, in main
    setup_urls(force=force)
  File "/home/imperaluna/.local/share/hatch/env/virtual/gutenberg2zim/6b97C51F/gutenberg2zim/lib/python3.11/site-packages/gutenberg2zim/urls.py", line 256, in setup_urls
    if start_rel_path_idx == -1:  # type: ignore
       ^^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'start_rel_path_idx' where it is not associated with a value

I even tried to make a docker image but it i had to ctrl+c it after about 1h

docker run my-image gutenberg2zim --books 84
[gutenberg2zim.constants::2024-03-07 21:16:20,934] INFO:Add possible url to db
[gutenberg2zim.constants::2024-03-07 21:16:20,934] DEBUG:bash -c rsync -a --list-only rsync://aleph.pglaf.org/gutenberg/ > tmp/file_on_aleph_pglaf_org
[gutenberg2zim.constants::2024-03-07 21:36:15,318] INFO:        Looking after relative path start in urls rsync result
[gutenberg2zim.constants::2024-03-07 21:36:16,023] INFO:        Removing all urls already present in DB
[gutenberg2zim.constants::2024-03-07 21:36:16,027] INFO:        Appending urls in DB from rsync result

elfkuzco · 2024-03-26T15:01:53Z

I would like to fix this @benoit74

benoit74 · 2024-03-26T17:05:19Z

@elfkuzco please do ^^

elfkuzco · 2024-04-05T07:58:38Z

Will probably take a bit longer than I expected because in some places, the code expects a string and passing an instance of Path (from pathlib) breaks things when I try to test. My idea is to add type annotations to it too rather than resolving the paths everywhere they are called.

benoit74 · 2024-04-05T10:27:09Z

Makes sense

elfkuzco · 2024-04-09T23:24:19Z

@benoit74 , I am almost done with this fix and I have tested the command python src/gutenberg2zim --books 84 and everything works fine (except that the paths displayed in the logs are fully resolved paths).

However, before I run more tests and submit a PR, I would like to point out that the usage of the os.path in the urls.py is fine and shouldn't be switched to pathlib.Path because those are actual strings (just urls) and have nothing to do with files. Using pathlib.Path here doesn't really seem "clean" as it will involve calling str on it too after building the string.

For example,

url = os.path.join("a", "b")

and

url = str(Path("a") / "b")

both achieve the same thing. But using Path seems a little bit harder to read given none of the actual Path operations are ever invoked in this module.

What do you think?

benoit74 · 2024-04-11T08:46:54Z

Great! Your point is important, thank you for asking. This is in fact even potentially a bug on Windows platforms.

First of all, I'm quite sure the code of your example does not achieve the same thing on Windows platform. You need to use the PurePosixPath class to achieve the same thing (pure because we do not intend to perform any real I/O with this path, posix because we always want / as separator even on Windows platforms).

And this can then be more neatly written like PurePosixPath("a", "b") (hence being quite close to os.path.join). You can even write stuff like PurePosixPath(*list_of_paths, "c") (instead of calling PurePosixPath multiple times like it was done for os.path.join.

elfkuzco · 2024-04-11T16:56:58Z

I tried to work with PurePosixPath and I noticed that if you call it on a url which has http:// (two consecutive slashes), it strips one off and this in turn causes unintended effects in the rest of the code. Example:

p = PurePosixPath("http://aleph.pglaf.org/cache/epub/84/pg84.epub")

yields

PurePosixPath('http:/aleph.pglaf.org/cache/epub/84/pg84.epub')

Notice that one of the slashes after the http: is gone. I think why it works with os.path is because it simply concatenates the paths "intelligently" using a path seperator. Same issue will arise even if we use Path.

In light of these observations, my opinion is to either leave the implementation as it is with os.path or default to building the strings by adding the slash operator explicitly.
The latter is my preferred approach as it is obvious to any reader that they are working with strings. Plus, there's no telling if the implementation of the os.path.join will change in future versions of the language.

benoit74 · 2024-04-12T12:44:13Z

Oh, if it is a real URL, then it is not a path (ok, that's obvious maybe ^^)

My recommendation would be to still get rid of os.path.join, this is purely misleading. I would consider to use a mix of PurePosixPath (to join the parts of url path or subpath) and simple string concatenation.

If I'm not mistaken,

base_url = os.path.join(
    os.path.join(*list(str(self.b_id))[:-1]), str(self.b_id)
)

would become

base_url = PurePosixPath(*list(str(self.b_id))[:-1]), str(self.b_id))

And

url = os.path.join(self.base, base_url)

would become

url = self.base + base_url

This means we have a convention which says that self.base (and other bases) MUST end with a slash (not a big concern AFAIK, we just have to fix BASE_THREE) and path must not start with a slash (but PurePosixPath never adds it magically)

WDYT of this? I do not mind if you consider it is too cumbersome to do, this is just a proposition to enhance the codebase, not something ultra important.

elfkuzco · 2024-04-12T13:20:02Z

In my implementation, I used f-strings completely and got rid of Path completely. And I chose the convention that paths end without a slash, so that I have to explicitly add the / myself. This, I think is more obvious to a reader of the code.

Here's a snippet of my refactoring for the build method:

        if self.base == self.BASE_ONE:
            if int(self.b_id) > 10:  # noqa: PLR2004
                components = "/".join(self.b_id[:-1])
                base_url = f"{components}/{self.b_id}"
            else:
                base_url = f"0/{self.b_id}"
            url = f"{self.base}/{base_url}"
        elif self.base == self.BASE_TWO:
            url = f"{self.base}/{self.b_id}"
        elif self.base == self.BASE_THREE:
            url = self.base
        return url

And here's where it is being used in building epub urls:

    name = "".join(["pg", b_id])
    url = f"{u.build()}/{name}.epub"
    url_images = f"{u.build()}/{name}-images.epub"
    url_noimages = f"{u.build()}/{name}-noimages.epub"
    urls.extend([url, url_images, url_noimages])

Aside from this, I think everything works completely. I have run the command for python src/gutenberg2zim --books 84 and python3 src/gutenberg2zim -l en,fr -f pdf --books 100-200 --bookshelves --title-search and everything works fine. Here's a log file for the first command
beta.log

I am currently running python src/gutenberg2zim for the full run. If that works and you are okay with using f-strings as I have, I will submit the PR

benoit74 · 2024-04-12T14:57:48Z

Yes, I'm fine with your proposition!

eshellman · 2024-04-12T20:38:59Z

Why are you trying to use pathlib on a URL? I think urllib.parse is what you want.

benoit74 · 2024-04-15T06:41:09Z

I think we all agree that using path methods on a URL is a bad idea

urllib.parse.urljoin is unfortunately not helping much because AFAIK it does not automatically append "/" between URL path segments (which is all we need in fact).

I'm quite pleased with the idea proposed by @elfkuzco to simply use string manipulation for that indeed (I've not yet read the PR yet, but I now think that string manipulation is the best solution for current situation).

rgaudin · 2024-04-16T08:16:04Z

Pretty sure we're doing similar stuff in other scrapers. I believe we use Path (PurePosixPath would be better) on the path or a parsed URL. Please take a look at other scrapers (wikihow, sotoki?) as this seems pretty common behavior in scrapers

benoit74 · 2024-04-16T08:24:37Z

I had a look at the PR and confirm I like the string manipulation proposed when manipulating URL path segments, it is way clearer than PurePosixPath or Path which are not working on URLs or os.path.join which is semantically wrong. See https://github.com/openzim/gutenberg/pull/225/files#diff-1c2c0aac8f278200c33b65c8b05dc7367b1687f64c7e8794a1b9da7b1ae5e369

benoit74 · 2024-04-16T08:26:39Z

I had a look in sotoki and wikihow and failed to find any URL which is dynamically built by concatenating many path segments.

rgaudin · 2024-04-16T09:00:20Z

it is way clearer than PurePosixPath or Path which are not working on URLs

There's a clue in the name that those are not the same thing 😉
Though, a URL path (path element of an urslplit()) is as close as it gets to a PurePosixPath. It can be very handy in situation where you need to resolve targets from relative notation or ensure/compute targets relative to one another.

We may not need it here nor in any scraper. I thought we did in wikiHow but actually we dont.

benoit74 added enhancement good first issue labels Aug 18, 2023

benoit74 changed the title ~~Remove usage of os.path, prefer pathlib~~ Remove usage of os.path and path.py, prefer pathlib Aug 18, 2023

benoit74 added this to the 2.2.0 milestone Aug 18, 2023

rgaudin assigned MUCCHU Dec 20, 2023

MUCCHU mentioned this issue Dec 20, 2023

Updated the Codebase to use pathlib instead of using os.path and path.py #208

Closed

benoit74 assigned ImperaLuna and unassigned MUCCHU Mar 7, 2024

benoit74 assigned elfkuzco and unassigned ImperaLuna Mar 26, 2024

elfkuzco mentioned this issue Apr 13, 2024

Replace usage of os.path and path.py with pathlib #225

Merged

benoit74 closed this as completed in #225 Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove usage of os.path and path.py, prefer pathlib #195

Remove usage of os.path and path.py, prefer pathlib #195

benoit74 commented Aug 18, 2023 •

edited

Loading

MUCCHU commented Dec 20, 2023

rgaudin commented Dec 20, 2023

MUCCHU commented Dec 20, 2023

ImperaLuna commented Mar 7, 2024

benoit74 commented Mar 7, 2024

ImperaLuna commented Mar 7, 2024

benoit74 commented Mar 7, 2024

ImperaLuna commented Mar 7, 2024 •

edited

Loading

elfkuzco commented Mar 26, 2024

benoit74 commented Mar 26, 2024

elfkuzco commented Apr 5, 2024

benoit74 commented Apr 5, 2024

elfkuzco commented Apr 9, 2024

benoit74 commented Apr 11, 2024

elfkuzco commented Apr 11, 2024

benoit74 commented Apr 12, 2024

elfkuzco commented Apr 12, 2024

benoit74 commented Apr 12, 2024

eshellman commented Apr 12, 2024

benoit74 commented Apr 15, 2024

rgaudin commented Apr 16, 2024

benoit74 commented Apr 16, 2024

benoit74 commented Apr 16, 2024

rgaudin commented Apr 16, 2024

Remove usage of os.path and path.py, prefer pathlib #195

Remove usage of os.path and path.py, prefer pathlib #195

Comments

benoit74 commented Aug 18, 2023 • edited Loading

MUCCHU commented Dec 20, 2023

rgaudin commented Dec 20, 2023

MUCCHU commented Dec 20, 2023

ImperaLuna commented Mar 7, 2024

benoit74 commented Mar 7, 2024

ImperaLuna commented Mar 7, 2024

benoit74 commented Mar 7, 2024

ImperaLuna commented Mar 7, 2024 • edited Loading

elfkuzco commented Mar 26, 2024

benoit74 commented Mar 26, 2024

elfkuzco commented Apr 5, 2024

benoit74 commented Apr 5, 2024

elfkuzco commented Apr 9, 2024

benoit74 commented Apr 11, 2024

elfkuzco commented Apr 11, 2024

benoit74 commented Apr 12, 2024

elfkuzco commented Apr 12, 2024

benoit74 commented Apr 12, 2024

eshellman commented Apr 12, 2024

benoit74 commented Apr 15, 2024

rgaudin commented Apr 16, 2024

benoit74 commented Apr 16, 2024

benoit74 commented Apr 16, 2024

rgaudin commented Apr 16, 2024

benoit74 commented Aug 18, 2023 •

edited

Loading

ImperaLuna commented Mar 7, 2024 •

edited

Loading