-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect image displayed #284
Comments
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
I think this is caused by the way image files are named. File names are currently made from the Adler32 checksum of the image url calculated here and this is liable to have lots of collisions since the urls are mainly in the format "https://i.stack.imgur.com/XXXXX.jpg" and Adler32 doesn't distribute checksums very evenly, especially with small inputs. Just doing a quick test generating random stack imgur urls I found that out of 1,000 URLs about 10 would collide, and out of 10,000 URLs almost 1,000 would collide, which fits with how often I encounter obviously wrong images on larger offline SE sites. I think it makes sense to switch to any common hash algorithm like MD5 or SHA1/2 since they're better suited to this kind of use and aren't much slower. Even doing a one-line change and switching to zlib's crc32 would massively reduce the problem (down to about 40 collisions per million), but given the minimal performance cost I think the above solution is better. Here's a quick demo of the issue:
|
I would suggest to use @rgaudin @kelson42 : does it makes sense for you as well, since you have more experience on this kind of topics? @sam-masaki : thank you a lot for the investigation and the valuable quick demo (even if I would have been convinced without it, it is way better with the demo 😉 ) ; would like to to propose a PR (no pressure, just a proposition)? |
Thanks, I just wanted to be thorough since I'm not familiar with the codebase (and I wasn't familiar with adler32). I can put in a PR soon just not right now since it's late for me. |
No hurry, the issue is open since March and we will probably not run again |
Moreover, one could concatenate the filesize of the image with the checksum to further reduce collisions: filename = f"{checksum}_{filesize}" |
Thank you very much. It's not noted anywhere but I believe We can give md5 a shot. We'd jump from 10chars to 32chars. |
Something like this should help developing a good "create_hash" function. |
You're welcome to do it. You can short-circuit all actual processing, downloads and ZIM creation as long as you render the HTML so the Image downloader receives URLs. |
Originally reported by @mazen-mardini in #283 (comment) we can see in this mathoverflow question that the image used in the bottom of the question is different from the one online at the time of the Dump (December 2022). Screenshots in #283
It is important to diagnose and understand what happened here
The text was updated successfully, but these errors were encountered: