-
-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the quality of article content indexed by xapian #1725
Comments
I agree with the principle of adding metatag information to opt-out part of the HTML text. Not sure this is the thing to do for all the examples you have given, but this sounds definitely a good approach. Give it a try! |
From @automactic on June 19, 2016 20:48 What code should I modify to add comments to html strings? Also, do you think adding comments to html string will increase the size of zim files? |
I've implemented few things in commits openzim/zimwriterfs@79921c8 and openzim/zimwriterfs@7302f47 (not merged) However, it seems that html span class names change depending of the lang of the article. |
Do not use create a new ofstream for each part.
Not sure if this is still relevant, but having stumbled across this issue while looking for something else, I'd just like to point out the value of indexing references (footnotes/endnotes) from an academic perspective. If I am searching for information about some obscure historical figure, it would be very valuable to be able to find quickly a bibliographical resource on that person, say, in an article that might be about some other event I would never have thought to look under. |
libzim now provide IndexData interface to allow the scrapper to give the data they want to be indexed. |
@mgautierfr Mostly Wikipedia/Mediawiki stuff. Moving to |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
From @automactic on June 14, 2016 21:36
Problem:
In current xapian indexing process, the content of of article extracted by omega contains a lot of useless info, such as reference section, the legal footnote and the inline references.
Desired Output:
A clean string of article content, without
Example:
The "apple juice" article in wikipedia_en_simple_all_2016-05.zim
Here is the info extracted by omega html parser and passed to xapian for indexing:
Possible Solution:
Add UdmCommentmmarkup to comment out parts of the html, so omega html parser can ignore them. (source)
Copied from original issue: kiwix/kiwix-xulrunner#244
The text was updated successfully, but these errors were encountered: