Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues #42 #43 #44 #45 and #47 #46

Closed
wants to merge 23 commits into from
Closed
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
280a975
Fix issue #43
bnfklm Apr 1, 2015
c06aff5
Fix issue #45
bnfklm Apr 1, 2015
4359e51
Fix issue #42 - adding a parameter <file.output>
bnfklm Apr 1, 2015
85902f7
Fix issue #44
bnfklm Apr 1, 2015
18acd5e
Fix issue #47
bnfklm Apr 1, 2015
7b964c6
changing version in pom.xml to 1.1.7-SNAPSHOT
bnfklm Apr 1, 2015
d8b589c
Fix issue #42 : adding a missing import
bnfklm Apr 1, 2015
d5827b3
Extraction from WAT to CDX : correcting RealCDXExtractorOutput.java a…
bnfklm Apr 7, 2015
618fda8
correcting whitespace around assignment operator and renaming 'Entity…
bnfklm Apr 14, 2015
6128ab3
adding changes in CHANGES.md
bnfklm Apr 16, 2015
92fdfd4
putting 'Actual-Content-Lenght into metaData + removing path from fil…
bnfklm Apr 17, 2015
ce23f36
generic defaults values for parameters in commons.properties
bnfklm Apr 21, 2015
4752056
removing Actual-Content-Length and Trailing-Slop-Length from WARC-Met…
bnfklm Apr 21, 2015
d7b7c2a
removing Actual-Content-Lenght and Trailing-Slop-Length from Warc-Met…
bnfklm Apr 22, 2015
8e9cd33
removing default values in common.properties and making them optional
bnfklm Apr 23, 2015
8740b56
adding spaces in comments for consistency
bnfklm May 4, 2015
b1dc2a0
changing 1.1.7 -> 1.1.6 in CHANGES.md
bnfklm May 11, 2015
2dc62c2
adding issue #48 in CHANGES.md
bnfklm May 11, 2015
ba89c3c
Update CHANGES.md
scheylord May 11, 2015
16e0c91
changing getHostName to getCanonicalHostName to conform to Heritrix
bnfklm Jun 24, 2015
c7c1aba
Merge branch '1.1.7-BnF' of github.com:scheylord/webarchive-commons i…
bnfklm Jun 24, 2015
b0fbabd
catching UnknownHostException similar to the Heritrix code
bnfklm Jun 26, 2015
6a0f6dc
correcting typo in log message
bnfklm Jul 3, 2015
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
1.1.6
-----
* [WAT extractor: adding information in WAT's warcinfo](https://github.com/iipc/webarchive-commons/issues/47)
* [WAT extractor: missing WARC format version](https://github.com/iipc/webarchive-commons/issues/45)
* [WAT extractor: envelope structure does not conform to the WAT specification](https://github.com/iipc/webarchive-commons/issues/44)
* [WAT extractor: WARC-Date in all records should be the WAT record generation date](https://github.com/iipc/webarchive-commons/issues/43)
* [WAT extractor: WARC-Filename in the WAT warcinfo record should be the WAT filename itself](https://github.com/iipc/webarchive-commons/issues/42)
* [WAT extractor: Entity-Trailing-Slop-Bytes should be called Entity-Trailing-Slop-Length](https://github.com/iipc/webarchive-commons/issues/48)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a line in here about the change you did for Entity-Trailing-Slop-Bytes should be called Entity-Trailing-Slop-Length (#48)? Thank you!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.1.5
-----
* [Escape redirect URLs in RealCDXExtractorOutput](https://github.com/iipc/webarchive-commons/pull/36)
Expand Down
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

<groupId>org.netpreserve.commons</groupId>
<artifactId>webarchive-commons</artifactId>
<version>1.1.6-SNAPSHOT</version>
<version>1.1.7-SNAPSHOT</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anjackson should this still be 1.1.6?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the last release was 1.1.5.

<packaging>jar</packaging>

<name>webarchive-commons</name>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ public void output(Resource resource) throws IOException {
String meta = "TBD";
String redir = "TBD";

if(format.equals("WARC")) {
if(format.startsWith("WARC")) {
origUrl = getWARCURL(m);
date = getWARCDate(m);
String type = getWARCType(m);
Expand Down
14 changes: 11 additions & 3 deletions src/main/java/org/archive/extract/ResourceExtractor.java
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
package org.archive.extract;

import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
Expand Down Expand Up @@ -74,7 +75,7 @@ public int run(String[] args)
if(args.length < 1) {
return USAGE(1);
}
if(args.length > 3) {
if(args.length > 4) {
return USAGE(1);
}
int max = Integer.MAX_VALUE;
Expand All @@ -89,7 +90,14 @@ public int run(String[] args)
}
}
String path = args[arg];
if(args.length == arg + 2) {
String outputFile = null;
if(args.length >= arg + 2) {
//if a output file is specified in the command line
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after "//"

if(args.length == arg + 3) {
outputFile = args[arg+2];
os.close();
os = new FileOutputStream(outputFile);
}
if(args[arg].equals("-cdx")) {
path = args[arg+1];
out = new RealCDXExtractorOutput(makePrintWriter(os));
Expand All @@ -100,7 +108,7 @@ public int run(String[] args)

} else if(args[arg].equals("-wat")) {
path = args[arg+1];
out = new WATExtractorOutput(os);
out = new WATExtractorOutput(os, outputFile);
} else {
String filter = args[arg+1];
out = new JSONViewExtractorOutput(os, filter);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ public void output(Resource resource) throws IOException {
String date = "TBD";
String canUrl = "TBD";

if(format.equals("WARC")) {
if(format.startsWith("WARC")) {
origUrl = getWARCURL(m);
date = getWARCDate(m);
String type = getWARCType(m);
Expand Down
71 changes: 61 additions & 10 deletions src/main/java/org/archive/extract/WATExtractorOutput.java
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.nio.charset.Charset;
import java.text.ParseException;
import java.net.UnknownHostException;
import java.util.Date;

import org.archive.format.gzip.GZIPMemberWriter;
Expand All @@ -22,18 +24,28 @@
import org.archive.util.io.CommitedOutputStream;
import org.json.JSONException;

import java.net.InetAddress;
import java.text.DateFormat;
import java.text.SimpleDateFormat;

import java.util.logging.Logger;

public class WATExtractorOutput implements ExtractorOutput {
WARCRecordWriter recW;
private boolean wroteFirst;
private GZIPMemberWriter gzW;
private static int DEFAULT_BUFFER_RAM = 1024 * 1024;
private int bufferRAM = DEFAULT_BUFFER_RAM;
private final static Charset UTF8 = Charset.forName("UTF-8");
private String outputFile;

private static final Logger LOG = Logger.getLogger(WATExtractorOutput.class.getName());

public WATExtractorOutput(OutputStream out) {
public WATExtractorOutput(OutputStream out, String outputFile) {
gzW = new GZIPMemberWriter(out);
recW = new WARCRecordWriter();
wroteFirst = false;
this.outputFile = outputFile;
}

private CommitedOutputStream getOutput() {
Expand All @@ -56,9 +68,9 @@ public void output(Resource resource) throws IOException {
throw new IOException("Missing Envelope.Format");
}
cos = getOutput();
if(envelopeFormat.equals("ARC")) {
if(envelopeFormat.startsWith("ARC")) {
writeARC(cos,top);
} else if(envelopeFormat.equals("WARC")) {
} else if(envelopeFormat.startsWith("WARC")) {
writeWARC(cos,top);
} else {
// hrm...
Expand All @@ -68,13 +80,51 @@ public void output(Resource resource) throws IOException {
}

private void writeWARCInfo(OutputStream recOut, MetaData md) throws IOException {
String filename = JSONUtils.extractSingle(md, "Container.Filename");
if(filename == null) {
throw new IOException("No Container.Filename...");
// filename is given in the command line
String filename = outputFile;
if (filename == null || filename.length() == 0) {
// if no filename by command line, we construct a default filename base on container filename
filename = JSONUtils.extractSingle(md, "Container.Filename");
if (filename == null) {
throw new IOException("No Container.Filename...");
}
if (filename.endsWith(".warc") || filename.endsWith(".warc.gz")) {
filename = filename.replaceFirst("\\.warc$", ".warc.wat.gz");
filename = filename.replaceFirst("\\.warc\\.gz$", ".warc.wat.gz");
} else if (filename.endsWith(".arc") || filename.endsWith(".arc.gz")) {
filename = filename.replaceFirst("\\.arc$", ".arc.wat.gz");
filename = filename.replaceFirst("\\.arc\\.gz$", ".arc.wat.gz");
}
}
// removing path from filename
File tmpFile = new File(filename);
filename = tmpFile.getName();
HttpHeaders headers = new HttpHeaders();
headers.add("Software-Info", IAUtils.COMMONS_VERSION);
headers.addDateHeader("Extracted-Date", new Date());
headers.add("software", IAUtils.COMMONS_VERSION);
headers.addDateHeader("extractedDate", new Date());

// add ip, hostname
try {
InetAddress host = InetAddress.getLocalHost();
headers.add("ip", host.getHostAddress());
headers.add("hostname", host.getCanonicalHostName());
} catch (UnknownHostException e) {
LOG.warning("unable top obtain local crawl engine host :\n"+e.getMessage());
}

headers.add("format", IAUtils.WARC_FORMAT);
headers.add("conformsTo", IAUtils.WARC_FORMAT_CONFORMS_TO);
// optional arguments
if(IAUtils.OPERATOR != null && IAUtils.OPERATOR.length() > 0) {
headers.add("operator", IAUtils.OPERATOR);
}
if(IAUtils.PUBLISHER != null && IAUtils.PUBLISHER.length() > 0) {
headers.add("publisher", IAUtils.PUBLISHER);
}
if(IAUtils.WAT_WARCINFO_DESCRIPTION != null && IAUtils.WAT_WARCINFO_DESCRIPTION.length() > 0) {
headers.add("description", IAUtils.WAT_WARCINFO_DESCRIPTION);
}

ByteArrayOutputStream baos = new ByteArrayOutputStream();
headers.write(baos);
recW.writeWARCInfoRecord(recOut,filename,baos.toByteArray());
Expand Down Expand Up @@ -105,8 +155,9 @@ private void writeWARC(OutputStream recOut, MetaData md) throws IOException {
} else {
targetURI = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Target-URI");
}
String capDateString = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Date");
capDateString = transformWARCDate(capDateString);
// handle date of generation in WARC format
DateFormat dateFormat = new SimpleDateFormat("yyyyMMddHHmmss");
String capDateString = dateFormat.format(new Date());
String recId = extractOrIO(md, "Envelope.WARC-Header-Metadata.WARC-Record-ID");
writeWARCMDRecord(recOut,md,targetURI,capDateString,recId);
}
Expand Down
3 changes: 2 additions & 1 deletion src/main/java/org/archive/resource/ResourceConstants.java
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ public interface ResourceConstants {
public static final String ENVELOPE_FORMAT = "Format";
public static final String ENVELOPE_FORMAT_ARC = "ARC";
public static final String ENVELOPE_FORMAT_WARC = "WARC";
public static final String ENVELOPE_FORMAT_WARC_1_0 = "WARC/1.0";

public static final String WARC_HEADER_LENGTH = "WARC-Header-Length";
public static final String WARC_HEADER_METADATA = "WARC-Header-Metadata";
Expand Down Expand Up @@ -104,7 +105,7 @@ public interface ResourceConstants {

public static final String HTTP_ENTITY_LENGTH = "Entity-Length";
public static final String HTTP_ENTITY_DIGEST = "Entity-Digest";
public static final String HTTP_ENTITY_TRAILING_SLOP = "Entity-Trailing-Slop-Bytes";
public static final String HTTP_ENTITY_TRAILING_SLOP = "Entity-Trailing-Slop-Length";

public static final String HTML_METADATA = "HTML-Metadata";
public static final String HTML_HEAD = "Head";
Expand Down
8 changes: 5 additions & 3 deletions src/main/java/org/archive/resource/warc/WARCResource.java
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ public WARCResource(MetaData metaData, ResourceContainer container,
this.response = response;

long length = -1;
metaData.putString(ENVELOPE_FORMAT, ENVELOPE_FORMAT_WARC);
metaData.putString(ENVELOPE_FORMAT, ENVELOPE_FORMAT_WARC_1_0);
metaData.putLong(WARC_HEADER_LENGTH, response.getHeaderBytes());
MetaData fields = metaData.createChild(WARC_HEADER_METADATA);
for(HttpHeader h : response.getHeaders()) {
Expand Down Expand Up @@ -68,11 +68,11 @@ public InputStream getInputStream() {
}

public void notifyEOF() throws IOException {
envelope.putLong(PAYLOAD_LENGTH, countingIS.getCount());
String digString = Base32.encode(digIS.getMessageDigest().digest());
envelope.putString(PAYLOAD_DIGEST, "sha1:"+digString);
if(container.isCompressed()) {
metaData.putLong(PAYLOAD_LENGTH, countingIS.getCount());
metaData.putLong(PAYLOAD_SLOP_BYTES, StreamCopy.readToEOF(response));
metaData.putString(PAYLOAD_DIGEST, "sha1:"+digString);
} else {
// consume trailing bytes if we can...
InputStream raw = response.getInner();
Expand All @@ -81,7 +81,9 @@ public void notifyEOF() throws IOException {
(PushBackOneByteInputStream) raw;
long numNewlines = StreamCopy.skipChars(pb1bis, CR_NL_CHARS);
if(numNewlines > 0) {
metaData.putLong(PAYLOAD_LENGTH, countingIS.getCount());
metaData.putLong(PAYLOAD_SLOP_BYTES, numNewlines);
metaData.putString(PAYLOAD_DIGEST, "sha1:"+digString);
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ public Resource getResource(InputStream is, MetaData parentMetaData,
if(headers.isCorrupt()) {
md.putBoolean(WARC_META_FIELDS_CORRUPT, true);
}
md.putLong(PAYLOAD_SLOP_BYTES, StreamCopy.readToEOF(is));
md.putLong(PAYLOAD_LENGTH, bytes);
parentMetaData.putLong(PAYLOAD_SLOP_BYTES, StreamCopy.readToEOF(is));
parentMetaData.putLong(PAYLOAD_LENGTH, bytes);
return new WARCMetaDataResource(md,container, headers);

} catch (HttpParseException e) {
Expand Down
33 changes: 33 additions & 0 deletions src/main/java/org/archive/util/IAUtils.java
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,10 @@
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.util.Properties;

/**
* Miscellaneous useful methods.
Expand All @@ -35,6 +38,11 @@ public class IAUtils {
public final static Charset UTF8 = Charset.forName("utf-8");

final public static String COMMONS_VERSION = loadCommonsVersion();
final public static String PUBLISHER = loadCommons("publisher");
final public static String OPERATOR = loadCommons("operator");
final public static String WAT_WARCINFO_DESCRIPTION = loadCommons("wat.warcinfo.description");
final public static String WARC_FORMAT = loadCommons("warc.format");
final public static String WARC_FORMAT_CONFORMS_TO = loadCommons("warc.format.conforms.to");

public static String loadCommonsVersion() {
InputStream input = IAUtils.class.getResourceAsStream(
Expand All @@ -57,6 +65,31 @@ public static String loadCommonsVersion() {
return version.trim();
}

public static String loadCommons(String id) {
InputStream input = IAUtils.class.getResourceAsStream("/org/archive/commons.properties");
Reader reader = null;
if (input == null) {
return "UNKNOWN";
}
try {
reader = new InputStreamReader(input, "UTF-8");
} catch (UnsupportedEncodingException e) {
return "UNKNOWN";
}
Properties prop = new Properties();
try {
prop.load(reader);
} catch (IOException e1) {
return "UNKNOWN";
}
if (prop.getProperty(id) != null) {
return prop.getProperty(id);
} else {
return "UNKNOWN";
}

}

public static void closeQuietly(Object input) {
if(input == null || ! (input instanceof Closeable)) {
return;
Expand Down
5 changes: 5 additions & 0 deletions src/main/resources/org/archive/commons.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
operator=
publisher=
wat.warcinfo.description=
warc.format=WARC File Format 1.0
warc.format.conforms.to=http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf