-
-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove HTML parsing from our source repository #377
Comments
Ideally, we shouldn't parse the html at all in libzim. The new writer api will allow the user code to provide its own content to index instead of letting libzim using the main html content. |
@mgautierfr I'm in favour of keeping the libzim as lean as possible. We slowly realise (with Zimit) that allowing libzim users a bit more freedom about ft index creation is necessary. Therefore, your last comment goes IMO in the right direction. But things still need to be easy to use, so we need to think twice about the alternative if the document parsing is not in the libzim anymore at some point. What is sure is that this is another topic/ticket than this one and I would like to see a ticket with the problem we want to fix and a discussion about it before we change this in the libzim. |
This is related as I'm not sure we must investigate on html parsing (find a new tool, update our code) if we plan to remove it and keep it only for compatibility.
openzim/mwoffliner#1725 seems good for that.
I was planning to do this in the libzim_next api change. |
@mgautierfr #325 talks about that clearly indeed:
But there is no clear explanation about neither the problem nor the solution. Same in #364 where it is not clear at all if Therefore I want to have first a clear description about problem/solution and an agreement on this topic if you want to follow that path. |
How
As I said, I will keep the html parsing for compatibility. It we be the default implementation of |
Now, with libzim7, we can specify a content to index whoch is different from the content of the article. But this does not resolve the primary purpose of this ticket. |
The Xapian team has moved on on their side and even if we still don't have a
Beside resolving this ticket, relying on an external One important point is that we need actually to retrieve the token from I would like to assess now if we can move forward on this which concretly means (1) sponsorised the create of a |
Incidentally, there's now support for EPUB (using libgepub which is maintained by GNOME: https://gitlab.gnome.org/GNOME/libgepub).
Not quite sure what you mean by "retrieve the token", but I think you're wanting to get the text that's extracted rather than having it indexed for you, right? So more of a "libomextracttext" rather than "libomindex". The current internal interface for the worker subprocess extractors is that you pass in a filename (I understand you'd want to be able to pass a buffer instead) and a MIME content-type string, and get out separate text strings for the document body, title, keywords, author (or from for email), to, cc , bcc, message-id, an integer for number of pages, and a The HTML parser has probably evolved a bit since you took a copy, but can be passed a buffer to give a subset of those metadata items. If you're using a custom subclass that still has the same basic approach of calling a method for each open tag, close tag, and content between tags. |
Yes, the text usable directly by the indexer (without html tags for example). |
I've been looking at what needs doing, and have a few further questions. I understand HTML is your primary format of interest, and both PDF and EPUB have been mentioned too. Are there other formats that are of enough interest to you that I should be aware of them at this point? (The formats omindex already has workers for should generally just work, but there are some formats that are extracted by running an external program (e.g. You want to be able to pass the data in a buffer rather than in a file. Would it work for you if that buffer was required to be allocated by "libomindex"? If the library does the allocation it can be shared with the worker via Are you interested in metadata (title, author, etc) and if so what items? We're limited in some cases by what the library we're using supports, but knowing what's of interest means I can try to ensure those are extracted when possible. |
Since the really beginning - almost 15 years ago - we parse the HTML document in the writer based on Xapian omindex code.
Unfortunately, because Omindex was and still is command line tool, I had to copy the interesting part of Omeindex' code and put it in our repo. I had to slightly patch it, but I don't believe there is anything really necessary (anymore?) here.
This is a bit of a problem because:
At the core of the problem we have Xapian which does not provide this feature within a library. There is a ticket on there side about it. The ticket is old a suprisingly it does not seem to be a priority. We might have to consider to help Xapian on this.
The text was updated successfully, but these errors were encountered: