-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does ipwb handle segmented response records? #374
Comments
@ibnesayeed There is offloading the responsibility and verifying whether ipwb does the right thing currently. This ticket is about verifying the correctness in ipwb. As an aside, I believe there was an effort to move to warcio at one point but something about the difference in the Iteration approach used kept that from moving forward. |
I thought the hiccup was due to Python version, but I might be wrong. |
Yeah, you should use warcio directly for reading the WARC, the latest of pywb just uses warcio as well. |
@ibnesayeed warcio reporting being compatible with Python 2, so this might have not been the issue. Hopefully that will be moot when we finish #51. We discussed utilizing parts of warcio in #129 and #211. @ikreymer Can you report on how warcio handles continuation record(s) chained from with a warc-response record? |
The WARC/1.1 spec (Section B.8) gives an example where a response record is segmented into multiple other smaller records. This changes the hash digests of the records both in the context of the
WARC-Block-Digest
andWARC-Payload-Digest
fields in the warc-response and continuation records but also in ipwb, which also likely calculates the multihash of the content in the initial response records and does not consider other segments.Let's check the implementation of the module we are using to extract the warc-response records with some dummy data along with a
continuation
records and aWARC-Segment-Number
field in the initial (and potentially subsequent) records.Ideally, it would be useful to have a data set of WARC exercising all of the features a la a set of minimum working examples but I have yet to come across such a data set. The key here would be MINIMAL examples without the cruft that may trip up other process and produce true/false positives/negatives
The text was updated successfully, but these errors were encountered: