Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wikiteam3 v4 release #176

Closed
wants to merge 75 commits into from
Closed

Conversation

yzqzss
Copy link

@yzqzss yzqzss commented Aug 27, 2023

Fixes

  • fix: --xmlapiexport infinite loop on page with missing revision
  • fix: config is not passed to log_error()
  • fix: parentid should be a positive integer
  • fix: truncated API response for "allrevisions" causes infinite loop (truncated API response for "allrevisions" causes infinite loop #166)
  • fix: yield bug in red_titles()
  • fix: Unable to traverse Special:Filelist
  • fix: can't find API URL from HTML (new MW versions)
  • ...

feats

  • feat: ensure params["limit"] is int and >= 2
  • feat: save timestamp metadata to images.txt
  • feat: cli: --disable-image-verify (Some images get dropped from the dump  #170)
  • feat: cli: add --image-timestamp-interval to download images uploaded in the given time interval (Add option to backup recent images only (with a custom cut-off date) #151)
  • feat: cli: add --user-agent to custom UA
  • feat: --ia-wbm-booster Incremental image dump powered by web.archive.org (experimental)
  • feat: write .mark file after dumped
  • feat: image: dynamic load config.delay
  • feat: If there are archives on IA in the last year, exit
  • feat: a trick to get original file for fandom.com
  • feat: more robust convert URLs to prefix filename
  • ...

refactor

countless
...

drop legacy code

  • gui.py
  • wikia.py
  • launcher.py
  • ...

Special:Export

  • Do not strip sha1 tags from XML

Breaking changes

  • image: change the magic NULL string from False to null
  • drop .desc xmlfile support

Drop launcher

Shifts compression responsibilities from the launcher to the uploader.

dependencies

  • remove: all useless dependencies
  • add: python-slugify
  • add (uploader): zstd

refactor uploader

  • Use zstd to compress .xml for better compression speed, compression ratio, decompression speed, streaming decompression. (with -{19..22} --ultra --long=31)
  • Instead of packaging the whole wikidump into wikidump&history.7z, we upload each file separately for better parsing and management in the future.
  • By default, the compression tasks of each process will be queued to avoid overload/OOM. (socket lock) (--parallel to disable it)
  • Upload logo. (optional)
  • Stop converting language codes to human readable formats.
  • IA ITEM medadata: add 'upload-state' and 'scanner'.
  • ...

https://pypi.org/project/wikiteam3/

yzqzss added 30 commits August 3, 2023 21:35
strip and unquote url
Convert domain to IDNA
Remove port `:80` and `:443` if `http://` and `https://` respectively
remove last slash if exists
truncate to last slash
remove "/any.php" suffix (`r"(/[^/]+\.php)"`)
remove ~ tilde
sulgify the url path if `ascii_slugify` is True
replace port(`:`) with underscore(`_`)
lower case
pre-bump 4.0.0
WIP: deprecate launcher
WIP: refactor uploader
WIP: rename function names from camelCase to snake_case
WIP: default args
WIP: feat: `--noverify-image-size`
WIP: enable `--xmlrevisions --curonly` support
TODO: feat: save `Special:Log` wiki page
break change: image: change the magic NULL string from `False` to `null`
chore: remove deprecated alias for xml.etree.ElementTree
TODO: feat: cli: add an option to disable random UserAgent
TODO: feat: Incremental image dump powered by archive.org (experimental)
WIP: feat: add option to dump recent images only (with a custom cut-off date)
Reasons:
* if the limit is float and between 0 to 1, the MW backend will force-int it to 0
* To loop over all the revisions, we need to retrieve at least 2 at a time
* If the historical revisions cannot be retrieved even with limit=2, it should fail at this time so that the user can try to use --curonly .
<element name="parentid" type="positiveInteger" minOccurs="0" maxOccurs="1"/>
The last titlelist may not be yielded
deprecate: `.desc` xmlfile support
feat: cli: `--disable-image-verify`
pre-bump: version 4.0.0
feat: cli: add  `--user-agent` to custom UA
feat: `--ia-wbm-booster`
feat: write .mark file after dumped
refactor: SessionMonkeyPatch
feat: image: dynamic load config.delay
@yzqzss yzqzss marked this pull request as ready for review August 27, 2023 01:31
@yzqzss
Copy link
Author

yzqzss commented Aug 27, 2023

Point of Conflict.

eeab93a

@randomnetcat
Copy link
Collaborator

Several things:

  • This is a truly massive PR. Could you consider splitting it up into smaller ones? I wouldn't even know where to start reviewing this.
  • Just glancing at the changes, you use the "saveweb" project in URLs several times, which is not this project.
  • From reading the description, it sounds like you're breaking format compatibility with upstream wikiteam and older versions of this project. That's not a good idea, since people are depending on that (at least I am in some of my code).

@elsiehupp
Copy link
Member

This is the sort of thing where I will defer to everyone else.

In general I would say that anything breaking backwards compatibility should be dependent on implementation of build-versioning that would allow users to reliably target older versions.

Right now this repository does not even have version-tagged GitHub builds, let alone versioned PyPI builds, so at this point breaking backwards compatibility is a no-go.

As for drastically refactoring the code… that's fine, and probably for the better, as long as there is build-versioning in place to protect existing users.

@elsiehupp
Copy link
Member

Regarding format compatibility: I think introducing a new default format is fine as long as the existing upstream format continues to be supported alongside it for a substantial "bridge" period, with (a) an ability to convert existing dumps to the new format, and (b) there are strong "deprecation" nudges encouraging users to migrate.

Refactoring (and abstracting much of the backend) could of course facilitate this, hence why I'm supportive of refactoring more generally.

Basically what I'm saying is that introducing a new data format should be dependent on first establishing a stable public API for the backend, which currently does not exist.

@robkam
Copy link
Member

robkam commented Aug 27, 2023

This is too complex for my level of comprehension - except that this PR for wikiteam3 is not in that repository.

@yzqzss yzqzss closed this Aug 29, 2023
@elsiehupp
Copy link
Member

Here's my thought…

@yzqzss why don't you open a new Pull Request from an earlier commit on this branch? (I think you have to create a branch from that commit in order to do so.)

This would be much easier to approach if it wasn't a gigantic total total rewrite all at once, and breaking it into chunks this way would help.

If you're not interested in doing thing, to be fair, we could try and do so ourselves, but you're more familiar with your own code than we are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants