Improve the quality of article content indexed by xapian #1725

kelson42 · 2017-05-14T16:44:02Z

From @automactic on June 14, 2016 21:36

Problem:

In current xapian indexing process, the content of of article extracted by omega contains a lot of useless info, such as reference section, the legal footnote and the inline references.

Desired Output:

A clean string of article content, without

title
whatnot, aka "xxx is not to be confused with"
inline reference, such as "[1]"
reference section
legal footnote: "this article is issued from......the media files."

Example:

The "apple juice" article in wikipedia_en_simple_all_2016-05.zim
Here is the info extracted by omega html parser and passed to xapian for indexing:

Title:Apple juice
Keywords:
Snippet:Apple juice Apple juice Not to be confused with cider. Apple juice is the juice from apples. It does not have alcohol, and it tastes sweet from the natural fruit sugars. Many companies making apple juice like to say that they do not add more sugar into the drink, and there is only natural sugar
Content:apple juice apple juice not to be confused with cider. apple juice is the juice from apples. it does not have alcohol, and it tastes sweet from the natural fruit sugars. many companies making apple juice like to say that they do not add more sugar into the drink, and there is only natural sugar. origin the apple tree came from the same era as elizabethan in the late 1500's and early 1600's (pyrus malus), and is native to britain. even in the old saxon papers, apples and cider are mentioned a lot.[1] the fruit is thought to have come in the caucasus, a place with many mountains between the black and caspian seas.[1] the lady apple, a kind of apple still grown today, is believed to be one of the oldest apple trees on record. healthiness it is remarkable how closely the history of the apple tree is connected with that of man. —henry david thoreau in both facts and stories, the apple appears to be very healthy. there are two types of apple juice. one is the clear apple juice, and the other is the cloudy apple juice. pectin and starch are taken out during the production process to produce clear apple juice. cloudy apple juice is cloudy because of evenly-distributed small pulp suspensions in the juice concentrate.[1] also, in apple juice, the vitamin c, and other vitamins are contained inside, as well as mineral nutrients such as boron which helps build strong bones. research from the university of massachusetts lowell shows that apple juice also increases acetylcholine in the brain, which gets you increased memory. apples can also be a main source of fiber, and is a powerful cleanser and an important necessity for the health of your body.[2] the compounds in apple juice called phytonutrients delay the break down of ldl or cholesterol. in history, the phrase from benjamin franklin "an apple a day keeps the doctor away" is very famous. new research is proving this phrase to be a fact. researchers at uc davis school of medicine have recently found out that drinking apple juice seems to slow down the process that may lead to heart disease. researchers at the university of groningen in the netherlands had studied and found that smokers who ate many fruits and vegetables, especially apples, had reduced their risk of getting the common diseases smokers would get. the risk was reduced by 50%.[2] for older people, drinking fruit juices should begin with apples, especially if they are suffering from arthritis and rheumatism. this is because apples carry a substantial amount of potassium. because of this, eating apples or apple juice has been known to help. drinking apple juice also removes some toxins from the liver and kidneys and is low in calories. over time, this can reduce the chances of having liver or kidney disease.[2] use apple juice can be used to make cider and calvados. some types of cider and all types of calvados contain alcohol. production addressed as one of the most popular fruits in the world, the apple is cultivated in around 7,500 different kinds in shape, color, texture, firmness, crispness, acidity, juiciness, sweetness, nutrition, and harvesting time.[1] references 1 2 3 4 "apple juice". agriculturalproductsindia.com. http://www.agriculturalproductsindia.com/beverages-juices/beverages-juices-apple-juice.html. retrieved 28 april 2010. 1 2 3 "apple juice". soymilkquick.com. http://www.soymilkquick.com/applejuice.php. retrieved 28 april 2010. this article is issued from wikipedia - version of the tuesday, april 26, 2016. the text is available under the creative commons attribution/share alike but additional terms may apply for the media files.

Possible Solution:

Add UdmCommentmmarkup to comment out parts of the html, so omega html parser can ignore them. (source)

Copied from original issue: kiwix/kiwix-xulrunner#244

The text was updated successfully, but these errors were encountered:

kelson42 · 2017-05-14T16:44:03Z

I agree with the principle of adding metatag information to opt-out part of the HTML text. Not sure this is the thing to do for all the examples you have given, but this sounds definitely a good approach. Give it a try!

kelson42 · 2017-05-14T16:44:04Z

From @automactic on June 19, 2016 20:48

What code should I modify to add comments to html strings? Also, do you think adding comments to html string will increase the size of zim files?

mgautierfr · 2017-07-19T16:21:22Z

I've implemented few things in commits openzim/zimwriterfs@79921c8 and openzim/zimwriterfs@7302f47 (not merged)
At indexing time, it try to remove span with reference or backlink.

However, it seems that html span class names change depending of the lang of the article.
For example, in English it is mw-cite-backlink and in french it is reference-text.
It seems pretty complicated to implement this correctly in libzim or zimwriterfs.
Maybe it should be mwoffliner to parse the html and tag the content correctly to index or not.

Do not use create a new ofstream for each part.

Jaifroid · 2023-01-09T02:13:10Z

Not sure if this is still relevant, but having stumbled across this issue while looking for something else, I'd just like to point out the value of indexing references (footnotes/endnotes) from an academic perspective. If I am searching for information about some obscure historical figure, it would be very valuable to be able to find quickly a bibliographical resource on that person, say, in an article that might be about some other event I would never have thought to look under.

mgautierfr · 2023-01-11T15:49:37Z

libzim now provide IndexData interface to allow the scrapper to give the data they want to be indexed.
Either we can close this issue or move it on the scrapper side (mwoffliner ?)

kelson42 · 2023-01-11T16:39:55Z

@mgautierfr Mostly Wikipedia/Mediawiki stuff. Moving to openzim/mwoffliner.

stale · 2023-05-26T18:32:41Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 added the enhancement label May 14, 2017

kelson42 mentioned this issue May 14, 2017

Improve the quality of article content indexed by xapian kiwix/kiwix-xulrunner#244

Closed

kelson42 referenced this issue in openzim/libzim Mar 30, 2020

Merge pull request #17 from openzim/fix_use_of_swap

30b90fb

Do not use create a new ofstream for each part.

mgautierfr mentioned this issue Jul 23, 2020

Remove HTML parsing from our source repository openzim/libzim#377

Open

kelson42 transferred this issue from openzim/libzim Jan 11, 2023

Jaifroid mentioned this issue Mar 14, 2023

Use libzim IndexData::getContent to provide currated content to index. #1810

Closed

stale bot added the stale label May 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the quality of article content indexed by xapian #1725

Improve the quality of article content indexed by xapian #1725

kelson42 commented May 14, 2017

kelson42 commented May 14, 2017

kelson42 commented May 14, 2017

mgautierfr commented Jul 19, 2017

Jaifroid commented Jan 9, 2023

mgautierfr commented Jan 11, 2023

kelson42 commented Jan 11, 2023

stale bot commented May 26, 2023

Improve the quality of article content indexed by xapian #1725

Improve the quality of article content indexed by xapian #1725

Comments

kelson42 commented May 14, 2017

Problem:

Desired Output:

Example:

Possible Solution:

kelson42 commented May 14, 2017

kelson42 commented May 14, 2017

mgautierfr commented Jul 19, 2017

Jaifroid commented Jan 9, 2023

mgautierfr commented Jan 11, 2023

kelson42 commented Jan 11, 2023

stale bot commented May 26, 2023