Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAT extractor: WARC-Filename in the WAT warcinfo record should be the WAT filename itself #42

Closed
saraaubry opened this issue Apr 1, 2015 · 0 comments
Milestone

Comments

@saraaubry
Copy link

In the current implementation of the WAT extractor, the WARC-Filename in tht WAT warcinfo record corresponds to the filename of the original (W)ARC record.
According to the WARC ISO standard, it should be the WAT filename itself.

Current:
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2015-02-18T10:24:54Z
WARC-Filename: BnF-6224-50-20150218094547-00001-ciblee_2015_menelas2.bnf.fr.warc.gz
WARC-Record-ID: urn:uuid:97a37ea9-1af4-4c47-8ae0-5515428347aa
Content-Type: application/warc-fields
Content-Length: 73

Target:
WARC/1.0
WARC-Type: warcinfo
WARC-Date: 2015-02-18T10:24:54Z
WARC-Filename: BnF-6224-50-20150218094547-00001-ciblee_2015_menelas2.bnf.fr.warc.wat.gz
WARC-Record-ID: urn:uuid:97a37ea9-1af4-4c47-8ae0-5515428347aa
Content-Type: application/warc-fields
Content-Length: 73

Implementation:
java extractor.jar -wat fichierA.warc.gz --> will go to standard output
WARC-Filename:
fichierA.warc.gz => fichierA.warc.wat.gz
fichierA.arc.gz => fichierA.arc.wat.gz
fichierA.warc => fichierA.warc.wat
fichierA.arc => fichierA.arc.wat

java extractor.jar -wat fichierA.warc.gz fichierB.wat.warc.gz --> will go to file fichierB output
WARC-Filename: fichierB.wat.warc.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants