Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure loading of recent public suffix list (effective_tld_names.dat) #17

Closed
sebastian-nagel opened this issue Feb 14, 2020 · 1 comment
Assignees
Labels

Comments

@sebastian-nagel
Copy link

The public suffix list (using the old file name "effective_tld_names.dat") is shipped twice in the Nutch job file in the dependency jar files of

The latter one ships with a heavily outdated version of the public suffix list. Crawler-commons EffectiveTldFinder loads the "effective_tld_names.dat" from class path. When running in distributed mode here is no control which dependency jar is first on the class path. So it may happen that the outdated version is loaded.

Ideally, the most recent version of the public suffix list should be used. This could be achieved by downloading the list during build and placing it in the "conf/" folder which is always first in the class path.

@sebastian-nagel
Copy link
Author

Solved by placing a recent public suffix list in the conf/ folder, e.g. via

wget -O conf/effective_tld_names.dat https://publicsuffix.org/list/public_suffix_list.dat

Solution is implemented in crawl machine the setup scripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant