-
-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plan to libzim2 #325
Comments
ping @kelson42, @julianharty, @veloman-yunkan, @jetownfeve21, @mossroy, @Jaifroid, @dnohales, @automactic, @MiguelRocha, @rgaudin Please add any other that may be interested. |
My personal wishes: :)
|
Thanks for the detailed overview of the plan. Can you elaborate on the use cases you see for the Entry categories? Trying to understand if it's targeting internal/technical use ( |
Thanks for sharing the plan with us. Additionally, I had to patch a little bit libzim and kiwix-lib to be able to compile them with emscripten. I did that in a very ugly way. Ideally, these patches would be applied upstream (in a cleaner way). And compiling with emscripten would be integrated in the regular build process. See kiwix/kiwix-js#508 |
I agree with @mossroy about the way forward for Kiwix JS, and the need for help from someone knowledgeable in compiling C++ (ideally with Emscripten). The block at our end is not for want of trying. More generally, however, I think it would be very important to document fully the proposed ZIM file format specification before beginning work on these breaking changes. Some things that at least extend the spec have been added without documenting them, most notably the search index encoded in a format that isn't mentioned on openZIM (and that breaks if the ZIM is split...). Please note that we currently rely, for search, on the title pointer list (we cannot read the compressed search index). I don't see anything above that threatens that. If that is correct, then it may be possible to adapt our reader if we are still unable to compile libzim to wasm, but compiling libzim is by far preferable. |
They are good wishes :) But a bit out of scope for now. (And I don't understand the third-party containers point)
Maybe we can help (depending of our priorities). While it is probably doable for libzim itself, I am a bit affraid of the dependencies compilation (icu4c, xapian, ...)
Yes, and it will be. But there is not a lot of change about the format for namespace, categories and extradata. None of them is breaking the format. If you can read a current zim file, you should be able to read a new one (without the new features). For the templating system, yes you will have to adapt as you need to render the article.
The search index is a xapian database. It was already the case before but it was in the If something is missing in the spec, please open a specific issue. |
I managed to compile both libzim and kiwix-lib with emscripten (see Makefile of https://github.com/mossroy/libzim_wasm/), but in a quick-and-dirty way. It needed to compile these dependencies too. |
I would really prefer to get things based on a minor version udpate of the ZIM format
I think yes. I have a bit difficulties how this looks like, in particular how we display things? The label of the category? The pagination of categories having 100.000 entries? The layout?
I don't really get that part. You mean store it in the |
Indeed
The dirent can store extra parameter (called parameter in the spec). We don't use it for now.
The category's content will be a binary content stored in a blob. It will be a list of article indexes. The question (several categories per article) is more about to find in which categories is an article. [1] In which cluster we are storing the extradata is an open question. If extradata are use to get information about an article without always acceding the article (listing, search, ...) it is better to use a specific cluster (potentially not compressed) for the extradata. If they are use in the same time as the article, it is better to store them in the same cluster of the article. I'm not sure we can decide that on the libzim side. |
Implementation has started |
Yes, the title is a lie.
libzim is already at version 6, we won't rename
libzim
tolibzim2
and zim format may not change.But we are discussing important changes in the libzim api and how we interpret the zim format.
This plan comes from several needs :
The plan is to pave the way for this changes.
Not everything will be made at once, but we should discuss this in the same time.
Removing namespace
Totally removing the namespace from the zim format is a pretty complex things,
it means a total zim file format change, and a rewriting a a big part of the library.
On top of that, removing them leads to new questions (How to store the metadata ?)
So the idea is to keep the namespaces but use them differently :
Creator
The creator will have different methods :
It will simplify the user code as it doesn't need to create an article for
metadata and redirect.
The "classical" article will not need to have
getNamespace
method and thepath will not contains it.
The namespaces will still be used :
'M' and 'X' for metadata and indexes (as before)
All other content will go in the "default" namespace : 'E' (for entry)
Article will loose those methods:
Reader
The reader api has several flaws.
The first one is about the articles.
The
File
methods to get an article always return anArticle
instance.Whatever the article exists or not, or if it is a redirect.
User has to call
good()
method to check if the article has been found andisRedirect
to check if it is a redirect.This is error prone as user will forget a check at a moment.
On top of that,
Article
are light weight objects. While it is a nice feature (We can copy the article),it also means that we cannot store things in the article (as url, title, namespace or blob)
The
File
object will have several new methods:string getMetadata(string name) throw NotFound
to get the content associated to the metadata namedname
.If the metadata is not found, an exception is raised.
Entry getHandleBy*() throw NotFound
to get a handle on an entry (article or redirect) base on the url or title.If the handle is not found, an exception is raised.
The handle itself will not contain a lot but will have methods:
bool isRedirect()
to know if the handle is a redirect or not.string getPath()
andstring getTitle()
as both article and redirect have a path and a title.shared_ptr<Entry> getEntry(bool follow) throw IsRedirect
to get an entry if the handle is not a redirect.If the handle is a redirect and follow is false, throw an exception.
If the handle is a redirect and follow is true, identical to
getRedirect
.shared_ptr<Entry> getRedirect() throw NotFound
to get the entry to which to redirect point.If the redirect point to a non existing article, NotFound is raised.
So, it would be impossible to get an instance of a invalid
Handle
orEntry
.If the user get a
Entry
, it can use it without further check.Removed methods
Article File::getArticle(char ns, const std::string& url) const
Article File::getArticleByTitle(char ns, const std::string& title) const
article_index_type File::getNamespaceBeginOffset(char ch) const
article_index_type File::getNamespaceEndOffset(char ch) const
article_index_type File::getNamespaceCount(char ns) const
std::string File::getNamespaces() const
bool File::hasNamespace(char ch) const
const_iterator File::findByTitle(char ns, const std::string& title) const
const_iterator File::find(char ns, const std::string& url) const
char Article::getNamespace() const
It worth to mention than all:
Article
will be renamed toEntry
: Image, js or css are not articles.Url
will be renamed toPath
.Compatibility
Backward compatibility (New lib reading old zim)
Old zim file can be detected because of the existence of
A
namespaceThe path exposed to the user (as article's path) will contains the path (previously longUrl)
This is needed for tools like zimdump who need to write the article in correct subdirectories to preserve relative links.
When accessing a article from a path :
A
namespaceI
)C
namespace.Forward compatibility (Old lib reading new zim)
The namespace of the article will be presented to the "user".
If the application do not give any meaning to the namespace, there should be no problem.
Links stored in the article are relative and will stay valid.
User will see the
C
namespace instead ofA
/I
namespace.The main issue would be about bookmark when user update zim file without updating software.
The software would not be able to search the article in
C
namespace instead ofA
.Small Writer Changes
While it will not be used internally for now, I plan to add some new method on the (writing) entry to have some more useful information :
getEntryHint
, that will return a hint to the creator about the entry, for example but not limited to :Then we could regroup those entries in the same cluster has they will probably be used together)
Same here, we could regroup this entries in the same (uncompressed) cluster to have nothing to decompress to display the main page.
getIndexingContent
. For now, we are indexing the same content as the displayed content (minus the html tag).This is not always efficient. (We may not indexing the sources/reference/external links of wikipedia article. Or not the "related" questions in stackoverflow...)
But we cannot do this on libzim side. The "selection" of the text must be done on the scrapper who know the context.
This would potentially allow use to index other content than html (such as image or video) as we dissociate the content of what is indexed.
It would be possible that
shouldCompress
andshouldIndex
will be removed in favor ofgetEntryHint
andgetIndexingContent
Category handling
With the change on the namespace there is important change to mention : The namespaces have a meaning internal to the libzim.
Before that, namespace already have meaning but it was done at user code level (kiwix-lib).
So we can add specific feature inside libzim itself.
The first one is category handling.
Categories would be stored in the
C
namespace. The path of the category would represent its name and the content would be a binary content listing the index of the article in the category.It is possible to add extra parameter to dirent in libzim. We don't use it for now but it is time now.
The entry's dirent would have an extra parameter the index of its category.
The (writing) entry would have one more method to implement :
getCategory()
returning the name (the path) of the category of the article.The (reading) file would have new methods to get the list of the category, or a iterator to the entries in the category.
The entry would have a new method
getCategory
that return the category name of the entry.[Question] Should we allow category on redirect ?
[Question] Should we allow an entry to be in several categories ?
Entry template
The same way we store categories in
C
namespace we can store templates inT
namespace.Each entry would store the index of its layout as extra parameter (as for category).
Templates would be mustache templates and could access to the entry's information (path, title, content, category and extradata) and the zim's metadata (name, "host", ...)
The method
getContent
of the entry would return the rendered content. (If no template is set do asgetRawContent
)The method
getRawContent
would return the content stored in the entry without doing template rendering.Entry extra data
While it would be possible to add extra information on a entry using extra parameter, I prefer to keep the extra parameter to things pretty short and "technical".
It would be also possible to use another category to store the extra data for an entry but it would double the number of dirent (and copy the path).
The solution I propose is to store the extra data in a blob and, as extra parameter, add the cluster/blob number in the direct for the extradata.
So, it will be two blobs per entry (at least entries with extradata).
The extradata itself would use the MessagePack format (https://msgpack.org/) and the top level object would be a map.
The (writing) entry will have a new method
getExtraDatas
returning a std::map for the extradata.The (reading) entry will have a new method
getExtraDataAs*(std::string name)
to get the value of a particular extradata.[Question] Should we allow extradata on redirect ?
Conclusion
This is a huge changes.
However, the namespace change is mostly an API change without real internal code change.
But it clearly introduce API break and it should be discuss with users of libzim.
It somehow introduce an API break if some reader/implementation assume that article are always in
A
namespace.Project using libzim should be adapted (kiwix-lib, zim-tools, zimwriterfs, node and python wrappers)
All other improvement are a bit more complex to do but can be done separately later. They are based on the new way to handle namespaces and I think will should agree on them, at least the outlines.
Please give feedbacks for this proposition.
The text was updated successfully, but these errors were encountered: