Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipwb index outputs unrelated information into CDXJ file #639

Closed
anatoly-scherbakov opened this issue Apr 26, 2020 · 19 comments
Closed

ipwb index outputs unrelated information into CDXJ file #639

anatoly-scherbakov opened this issue Apr 26, 2020 · 19 comments
Labels

Comments

@anatoly-scherbakov
Copy link
Contributor

I am creating a CDXJ file from a WARC archive in a bash script, like this:

ipwb index "$warc_file.warc.gz" > "$warc_file.cdxj"

I subsequently put the file into IPFS, where it can be easily looked up, and an issue can be demonstrated:

$ ipfs cat QmZup5pBAJoXv49eszWZ3miJ9VMPnsKF3Ez2qRHXZ2pzoC | head -n 4
This version of ipwb is outdated. Please run pip install --upgrade ipwb.
* Latest version: 0.2020.4.24.1847
* Installed version: 0.2019.8.15.213
!context ["http://tools.ietf.org/html/rfc7089"]

First three lines should not be printed: due to them, my CDXJ file is now invalid. I suggest to use logger.warning for these messages instead of using print(). This way, the messages will be sent to stderr rather than to stdout. I aim to prepare a PR for that purpose shortly.

@ibnesayeed
Copy link
Member

Yes, this is a bug. The util.py file has a function defiled as logError() for this purpose, which should be used instead of print().

@ibnesayeed ibnesayeed added the bug label Apr 26, 2020
@anatoly-scherbakov
Copy link
Contributor Author

  1. I suggest to use Python's built-in logging instead a wrapper function around print(). There are plugins for flake8 which consider print() an antipattern.

  2. I would actually like to remove the call of util.check_pypi_for_update() from def main(). This hurts performance; it slows down tests; if the index operation is performed on really many WARC files - there is a potential to DoS pypi.org. Maybe this check can be performed by a CLI subcommand, something like ipywb check-for-update?

@machawk1
Copy link
Member

I am unable to replicate this behavior and want to nail down the why so we can also fix it in other places like you have provided in #640, @anatoly-scherbakov.

Perhaps it is the WARC itself but maybe it is the OS. I am using macOS 10.15.4 Python 3.8.0 and ipwb 0.2019.08.15.0213 from pypi (reinstalled previous version for testing).

% ipwb --version
InterPlanetary Wayback 0.2019.08.15.0213
% ipwb index samples/warcs/5mementos.warc > warc_file.cdxj
Processing WARC records in 5mementos.warc complete
% ipfs add warc_file.cdxj
added QmboaMdp1VJGqUfhYjva3JrQz3QbsJft4RVQ42sEp4tE3F warc_file.cdxj
% ipfs cat QmboaMdp1VJGqUfhYjva3JrQz3QbsJft4RVQ42sEp4tE3F
!context ["http://tools.ietf.org/html/rfc7089"]
!meta {"generator": "InterPlanetary Wayback v.0.2019.08.15.0213", "created_at": "2020-04-27T11:12:51.987909"}
us,anothersite)/ 20161231110000
...

I agree that the first three lines should not be printed. I am curious as to why my freshly pulled copy did not provide the outdated version error.

I also agree that we should use logging instead of print() and instill that into the flake8 checks.

I would like the update command to be run automatically, but perhaps we should incorporate some smarts into it. Software that requires manually invocation for updating rarely gets updated. We were retaining pseudo-preferences for ipwb within the config file in ~/.ipfs, which might serve as a place for storing last check time and short-circuiting but I think overall, @ibnesayeed was against retaining any state like this between executions.

For the crux of this issue, @anatoly-scherbakov can you verify if this is replicable when using the latest version of ipwb we released last week (v0.2020.4.24.1847)? Regardless of whether it is (i.e., it will rear its head again on the next update), I would like to identify why this is not replicable between our two instances of the same build version.

@ibnesayeed
Copy link
Member

I suggest to use Python's built-in logging instead a wrapper function around print(). There are plugins for flake8 which consider print() an antipattern.

Using loggers is certainly a good practice and I am in support for it. However, I suggested to reuse logError() function in the previous reply to ensure consistency. I was thinking that a separate PR would be better to change every potential logging instance. These print statements are yet another trail from the hackathon days where things were put together in a hurry.

I would actually like to remove the call of util.check_pypi_for_update() from def main(). This hurts performance; it slows down tests; if the index operation is performed on really many WARC files - there is a potential to DoS pypi.org. Maybe this check can be performed by a CLI subcommand, something like ipywb check-for-update?

I personally do not like built-in check for update feature in every app and almost hate per application auto update feature. This sounds so non-Unixy to me. Software distribution, update check/notification, and update installation are responsibilities of app repositories and update managers. In early days of Windows and Mac, individual applications used to have their own distribution strategies so it made more sense in those applications. I think @machawk1 likes this feature. I remember talking to him about moving this functionality in a separate thread or something so it does not block the main execution thread, but then the notification might be lost in other messages and not be as noticeable. Unintentional DoS is a good point when the script is invoked multiple times, thought indexing does not have to be executed once for each WARC file, it should accept multiple files, wildcard, or a directory path containing WARC (it should be implemented, if it does not do so already).

@machawk1
Copy link
Member

I think @machawk1 likes this feature.

Yep, it is what I am used to: the option to know, on launch, whether you are using the latest version of the software. Many Mac apps continue to do this. Those that use a package manager, e.g., homebrew or the macOS app store, rely on said manager to let the system know that software may be out-of-date. I do not think the "manager" we are anticipating to be the means of installation does this auto-check like homebrew does when installing any other package.

Ultimately, it is not critically important to me and the automatic check on each run is invasive and could be strategically culled.

thought (sic) indexing does not have to be executed once for each WARC file

Right, it should only be called once on the initial invocation and not while iterating through each file...unless multiple instances are running as the usage pattern instead of feeding all WARCs to a single instance of ipwb.

@ibnesayeed
Copy link
Member

@machawk1: I am unable to replicate this behavior and want to nail down the why...

I was able to replicate it:

[localhost] $ docker container run --rm -it python:3 bash
[container] # pip install ipwb==0.2019.8.15.213
[container] # ipwb --version
This version of ipwb is outdated. Please run pip install --upgrade ipwb.
* Latest version: 0.2020.4.24.1847
* Installed version: 0.2019.8.15.213
InterPlanetary Wayback 0.2019.08.15.0213

@ibnesayeed
Copy link
Member

Ultimately, it is not critically important to me and the automatic check on each run is invasive and could be strategically culled.

One way to minimize the footprint of automatic check would be to only call it when a version check is performed (i.e., ipwb --version) and if we want to be a little more aggressive then it can be invoked AFTER top level help message print so that someone typing ipwb --help gets to see it (but not when running ipwb index/replay --help). Let's just not invoke it on regular replay and index commands.

@machawk1
Copy link
Member

@ibnesayeed I see here that you can replicate the version being reported but can you replicate it being included in a CDXJ?

@machawk1
Copy link
Member

only call it when a version check is performed

This is a good approach. For discussion (though I agree with your method), why not have it also be run with the index/replay help as well as the top level?

@ibnesayeed
Copy link
Member

I see here that you can replicate the version being reported but can you replicate it being included in a CDXJ?

Yes, bellow the fold I demonstrate it in the same container I illustrated the issue above:

[container] # wget https://github.com/oduwsdl/ipwb/raw/master/samples/warcs/salam-home.warc
[container] # ipwb index salam-home.warc > salam-home.cdxj
Daemon is not running at /dns/localhost/tcp/5001/http
[container] # cat salam-home.cdxj 
This version of ipwb is outdated. Please run pip install --upgrade ipwb.
* Latest version: 0.2020.4.24.1847
* Installed version: 0.2019.8.15.213

Now, installing and running IPFS daemon and then trying to index again:

[container] # export IPFS_VERSION=v0.4.23
[container] # wget -q https://dist.ipfs.io/go-ipfs/${IPFS_VERSION}/go-ipfs_${IPFS_VERSION}_linux-amd64.tar.gz && tar xvfz go-ipfs*.tar.gz && mv go-ipfs/ipfs /usr/local/bin/ipfs
# ipfs daemon --init &
# ipwb index salam-home.warc > salam-home.cdxj
Processing WARC records in salam-home.warc complete
# cat salam-home.cdxj
This version of ipwb is outdated. Please run pip install --upgrade ipwb.
* Latest version: 0.2020.4.24.1847
* Installed version: 0.2019.8.15.213
!context ["http://tools.ietf.org/html/rfc7089"]
!meta {"generator": "InterPlanetary Wayback v.0.2019.08.15.0213", "created_at": "2020-04-27T16:37:49.042863"}
edu,odu,cs)/~salam/ 20160305192247 {"locator": "urn:ipfs/QmUmhcmHu8SmCut6tJn6e6iM3Cio8U8t5dBgpeWk29GBqZ/QmXq5hNRxDBRpB8Cq2KsysBbWsfmjaWr7DZ4wM}

@anatoly-scherbakov
Copy link
Contributor Author

Using loggers is certainly a good practice and I am in support for it. However, I suggested to reuse logError() function in the previous reply to ensure consistency. I was thinking that a separate PR would be better to change every potential logging instance.

Yes, that's a point. But, my practice suggests such a PR usually never happens :)

--version check is showing a version, and that's what a normal Unix program does. It does not call the network. Imagine some day someone writes a shell script using ipwb and check for its version; they will not expect this check takes more than a millisecond because they do not expect a network call.

If @machawk1 likes this feature - maybe we could revisit the option of creating a very special subcommand for it, the name of which explicitly and unambiguously explains what it does?

@anatoly-scherbakov
Copy link
Contributor Author

I also refactored the code into check_for_update.py file - this is a temporary measure; once the number of these specialized files grows we could group them into packages like services or utils. For now I'd suggest to keep them top level, until the proper structure becomes apparent.

@ibnesayeed
Copy link
Member

--version check is showing a version, and that's what a normal Unix program does. It does not call the network. Imagine some day someone writes a shell script using ipwb and check for its version; they will not expect this check takes more than a millisecond because they do not expect a network call.

This is a very good point.

@machawk1
Copy link
Member

I completely agree. --version should not hit the network.

maybe we could revisit the option of creating a very special subcommand for it

This is sort-of what go-ipfs does to the extreme of ipfs-update being a separate package. If there are severe performance ramifications and problematic calls to pypi, perhaps we should disable this feature for now and allow users to manually update via pip or pulling a fresh source. That's intuitive enough for most and does not put the responsibility of the software being updated on the software itself.

@anatoly-scherbakov
Copy link
Contributor Author

In that case, does everyone agree to completely remove the code related to this version check? If ever needed, it can always be retrieved from Git history.

@anatoly-scherbakov
Copy link
Contributor Author

One more thing I could add is that not everyone actually uses pip, which means the message is almost always relevant but not in every case. I am using poetry for my newer projects, where you don't need to edit requirements yourself; you just say

poetry add ipwb

or

poetry update ipwb

and everything is done for you. Let's delegate this task to the systems which are built specifically for the purpose.

@machawk1
Copy link
Member

I agree and interesting point on poetry, I am unfamiliar with it. Feel free to send a PR removing the version checks, @anatoly-scherbakov.

@machawk1
Copy link
Member

#366 is related regarding the usage of other package managers and dependency specification (e.g., Pipfile).

@ibnesayeed
Copy link
Member

ibnesayeed commented Apr 27, 2020

I am usually very happy when more lines of code are removed from a repository than added. ;-)

Again, I am generally not in favor of implementing check for update feature in every tool, which has a lot of coupling and dictates one way of doing things that may not align with individual users' preferences and workflows. This may be a cool feature for desktop applications with UIs, but server apps or packages do not need such things in place. Often these services are run using bootstrapping scripts and there are no eyes to look at what was logged at the time of application boot. Auto updates (that were not part of IPWB in the first place) are even more dangerous when services are run in containers and state of those container images should be frozen, especially when big deployments require signing of application state by their DevOp and QA chain.

That said, if one is using pip to install and manage IPWB, they can do something like:

$ pip list --outdated
Package Version         Latest           Type 
------- --------------- ---------------- -----
Flask   1.1.1           1.1.2            wheel
ipwb    0.2019.8.15.213 0.2020.4.24.1847 wheel
six     1.11.0          1.14.0           wheel

Or

$ pip search ipwb
ipwb (0.2020.4.24.1847)  - InterPlanetary Wayback (ipwb): Web Archive integration with IPFS
  INSTALLED: 0.2019.8.15.213
  LATEST:    0.2020.4.24.1847

So, let's get rid of the built-in check for update feature and optionally document update commands in the README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants