Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan to libzim2 #325

Closed
mgautierfr opened this issue Apr 29, 2020 · 12 comments
Closed

Plan to libzim2 #325

mgautierfr opened this issue Apr 29, 2020 · 12 comments

Comments

@mgautierfr
Copy link
Collaborator

mgautierfr commented Apr 29, 2020

Yes, the title is a lie.
libzim is already at version 6, we won't rename libzim to libzim2 and zim format may not change.
But we are discussing important changes in the libzim api and how we interpret the zim format.


This plan comes from several needs :

The plan is to pave the way for this changes.
Not everything will be made at once, but we should discuss this in the same time.

Removing namespace

Totally removing the namespace from the zim format is a pretty complex things,
it means a total zim file format change, and a rewriting a a big part of the library.
On top of that, removing them leads to new questions (How to store the metadata ?)

So the idea is to keep the namespaces but use them differently :

  • Make the namespace totally internal in the implementation.
  • Change the API hide the namespace and provide higher level function.

Creator

The creator will have different methods :

  • addMetadata (instead of adding article in 'M' namespace)
  • addRedirect (instead of adding a redirect article)
  • addEntry (instead of adding "classical" article)

It will simplify the user code as it doesn't need to create an article for
metadata and redirect.
The "classical" article will not need to have getNamespace method and the
path will not contains it.

The namespaces will still be used :

'M' and 'X' for metadata and indexes (as before)

All other content will go in the "default" namespace : 'E' (for entry)

Article will loose those methods:

  • isRedirect, getRedirectUrl => no more used, we have the addRedirect method
  • isLinkTarget => Never used
  • isDeleted => Never used

Reader

The reader api has several flaws.
The first one is about the articles.

The File methods to get an article always return an Article instance.
Whatever the article exists or not, or if it is a redirect.
User has to call good() method to check if the article has been found and isRedirect to check if it is a redirect.
This is error prone as user will forget a check at a moment.

On top of that, Article are light weight objects. While it is a nice feature (We can copy the article),
it also means that we cannot store things in the article (as url, title, namespace or blob)

The File object will have several new methods:

  • string getMetadata(string name) throw NotFound to get the content associated to the metadata named name.
    If the metadata is not found, an exception is raised.
  • Entry getHandleBy*() throw NotFound to get a handle on an entry (article or redirect) base on the url or title.
    If the handle is not found, an exception is raised.

The handle itself will not contain a lot but will have methods:

  • bool isRedirect() to know if the handle is a redirect or not.
  • string getPath() and string getTitle() as both article and redirect have a path and a title.
  • shared_ptr<Entry> getEntry(bool follow) throw IsRedirect to get an entry if the handle is not a redirect.
    If the handle is a redirect and follow is false, throw an exception.
    If the handle is a redirect and follow is true, identical to getRedirect.
  • shared_ptr<Entry> getRedirect() throw NotFound to get the entry to which to redirect point.
    If the redirect point to a non existing article, NotFound is raised.

So, it would be impossible to get an instance of a invalid Handle or Entry.
If the user get a Entry, it can use it without further check.

Removed methods

  • Article File::getArticle(char ns, const std::string& url) const
  • Article File::getArticleByTitle(char ns, const std::string& title) const
  • article_index_type File::getNamespaceBeginOffset(char ch) const
  • article_index_type File::getNamespaceEndOffset(char ch) const
  • article_index_type File::getNamespaceCount(char ns) const
  • std::string File::getNamespaces() const
  • bool File::hasNamespace(char ch) const
  • const_iterator File::findByTitle(char ns, const std::string& title) const
  • const_iterator File::find(char ns, const std::string& url) const
  • char Article::getNamespace() const

It worth to mention than all:

  • Article will be renamed to Entry : Image, js or css are not articles.
  • Url will be renamed to Path.

Compatibility

Backward compatibility (New lib reading old zim)

Old zim file can be detected because of the existence of A namespace

The path exposed to the user (as article's path) will contains the path (previously longUrl)
This is needed for tools like zimdump who need to write the article in correct subdirectories to preserve relative links.

When accessing a article from a path :

  • Path without namespace on old zim => search article in A namespace
  • Path without namespace on new zim => No change
  • Path with namespace on old zim => use the namespace in the path (to be able to access image in I)
  • Path with namespace on new zim (old bookmark) => remove namespace and search in C namespace.

Forward compatibility (Old lib reading new zim)

The namespace of the article will be presented to the "user".
If the application do not give any meaning to the namespace, there should be no problem.
Links stored in the article are relative and will stay valid.
User will see the C namespace instead of A/I namespace.
The main issue would be about bookmark when user update zim file without updating software.
The software would not be able to search the article in C namespace instead of A.

Small Writer Changes

While it will not be used internally for now, I plan to add some new method on the (writing) entry to have some more useful information :

  • getEntryHint, that will return a hint to the creator about the entry, for example but not limited to :
    • If it is a main article
    • If the article is a "chrome" entry (css, js, ...).
      Then we could regroup those entries in the same cluster has they will probably be used together)
    • If the entry is about the first page (html but also css, js).
      Same here, we could regroup this entries in the same (uncompressed) cluster to have nothing to decompress to display the main page.
  • getIndexingContent. For now, we are indexing the same content as the displayed content (minus the html tag).
    This is not always efficient. (We may not indexing the sources/reference/external links of wikipedia article. Or not the "related" questions in stackoverflow...)
    But we cannot do this on libzim side. The "selection" of the text must be done on the scrapper who know the context.
    This would potentially allow use to index other content than html (such as image or video) as we dissociate the content of what is indexed.

It would be possible that shouldCompress and shouldIndex will be removed in favor of getEntryHint and getIndexingContent

Category handling

With the change on the namespace there is important change to mention : The namespaces have a meaning internal to the libzim.
Before that, namespace already have meaning but it was done at user code level (kiwix-lib).
So we can add specific feature inside libzim itself.

The first one is category handling.

Categories would be stored in the C namespace. The path of the category would represent its name and the content would be a binary content listing the index of the article in the category.
It is possible to add extra parameter to dirent in libzim. We don't use it for now but it is time now.
The entry's dirent would have an extra parameter the index of its category.

The (writing) entry would have one more method to implement : getCategory() returning the name (the path) of the category of the article.
The (reading) file would have new methods to get the list of the category, or a iterator to the entries in the category.

The entry would have a new method getCategory that return the category name of the entry.

[Question] Should we allow category on redirect ?
[Question] Should we allow an entry to be in several categories ?

Entry template

The same way we store categories in Cnamespace we can store templates in T namespace.
Each entry would store the index of its layout as extra parameter (as for category).

Templates would be mustache templates and could access to the entry's information (path, title, content, category and extradata) and the zim's metadata (name, "host", ...)

The method getContent of the entry would return the rendered content. (If no template is set do as getRawContent)
The method getRawContent would return the content stored in the entry without doing template rendering.

Entry extra data

While it would be possible to add extra information on a entry using extra parameter, I prefer to keep the extra parameter to things pretty short and "technical".
It would be also possible to use another category to store the extra data for an entry but it would double the number of dirent (and copy the path).

The solution I propose is to store the extra data in a blob and, as extra parameter, add the cluster/blob number in the direct for the extradata.
So, it will be two blobs per entry (at least entries with extradata).

The extradata itself would use the MessagePack format (https://msgpack.org/) and the top level object would be a map.

The (writing) entry will have a new method getExtraDatas returning a std::map for the extradata.
The (reading) entry will have a new method getExtraDataAs*(std::string name) to get the value of a particular extradata.

[Question] Should we allow extradata on redirect ?

Conclusion

This is a huge changes.
However, the namespace change is mostly an API change without real internal code change.
But it clearly introduce API break and it should be discuss with users of libzim.
It somehow introduce an API break if some reader/implementation assume that article are always in A namespace.
Project using libzim should be adapted (kiwix-lib, zim-tools, zimwriterfs, node and python wrappers)

All other improvement are a bit more complex to do but can be done separately later. They are based on the new way to handle namespaces and I think will should agree on them, at least the outlines.

Please give feedbacks for this proposition.

@mgautierfr
Copy link
Collaborator Author

mgautierfr commented Apr 29, 2020

ping @kelson42, @julianharty, @veloman-yunkan, @jetownfeve21, @mossroy, @Jaifroid, @dnohales, @automactic, @MiguelRocha, @rgaudin

Please add any other that may be interested.

@data-man
Copy link
Contributor

My personal wishes: :)

  • switch to C++17.
  • use std::string_view (or analogues) for avoiding unnecessary string allocations
  • use std::filesystem (or analogues)
  • use third-party containers (maps and maybe strings)

@rgaudin
Copy link
Member

rgaudin commented Apr 30, 2020

Thanks for the detailed overview of the plan.

Can you elaborate on the use cases you see for the Entry categories? Trying to understand if it's targeting internal/technical use (wp_article, images, styles) with restrictions or if it's aimed at ZIM end-users (Places, Restaurants).

@mgautierfr
Copy link
Collaborator Author

It could be both

I was thinking about end-users category (as I understand #75 and #317), but it could also be used to have a list of "main" articles (opposed to images and styles)

@mossroy
Copy link

mossroy commented May 2, 2020

Thanks for sharing the plan with us.
kiwix-js is currently not using libzim and kiwix-lib. It relies on a custom javascript implementation (and does not support all features like full-text search).
While some of these changes might be implemented in our code, I think the only sustainable way to follow these changes (and some other ones like using other compression formats) is to leave our custom implementation, and use libzim and kiwix-lib instead (by compiling them with emscripten).
Some work has already been done on that, and it's probably possible to achieve it, but it's still a work-in-progress for now. Because of lack of time on my side, and also lack of skills on C/C++ (and their build ecosystem).
We definitely need some help to finalize this transition, else kiwix-js won't be able to catch up with newer ZIM files. See kiwix/kiwix-js#509, kiwix/kiwix-js#510, kiwix/kiwix-js#511, kiwix/kiwix-js#512, kiwix/kiwix-js#513 (which should be fixed with newer versions of emscripten), and kiwix/kiwix-js#515. In order to finally implement kiwix/kiwix-js#514

Additionally, I had to patch a little bit libzim and kiwix-lib to be able to compile them with emscripten. I did that in a very ugly way. Ideally, these patches would be applied upstream (in a cleaner way). And compiling with emscripten would be integrated in the regular build process. See kiwix/kiwix-js#508

@Jaifroid
Copy link

Jaifroid commented May 3, 2020

I agree with @mossroy about the way forward for Kiwix JS, and the need for help from someone knowledgeable in compiling C++ (ideally with Emscripten). The block at our end is not for want of trying.

More generally, however, I think it would be very important to document fully the proposed ZIM file format specification before beginning work on these breaking changes. Some things that at least extend the spec have been added without documenting them, most notably the search index encoded in a format that isn't mentioned on openZIM (and that breaks if the ZIM is split...).

Please note that we currently rely, for search, on the title pointer list (we cannot read the compressed search index). I don't see anything above that threatens that. If that is correct, then it may be possible to adapt our reader if we are still unable to compile libzim to wasm, but compiling libzim is by far preferable.

@mgautierfr
Copy link
Collaborator Author

My personal wishes: :) [……] C++17/std::string_view/std::filesystem

They are good wishes :) But a bit out of scope for now. (And I don't understand the third-party containers point)

need for help from someone knowledgeable in compiling C++ (ideally with Emscripten)

Maybe we can help (depending of our priorities). While it is probably doable for libzim itself, I am a bit affraid of the dependencies compilation (icu4c, xapian, ...)

More generally, however, I think it would be very important to document fully the proposed ZIM file format specification before beginning work on these breaking changes

Yes, and it will be. But there is not a lot of change about the format for namespace, categories and extradata. None of them is breaking the format. If you can read a current zim file, you should be able to read a new one (without the new features). For the templating system, yes you will have to adapt as you need to render the article.

most notably the search index encoded in a format that isn't mentioned on openZIM (and that breaks if the ZIM is split...).

The search index is a xapian database. It was already the case before but it was in the .idx directory, distributed alongside of the zim file. Now (since 3 years) the database is embedded in the zim file as any other file in the archive (but in X namespace). Zim library doesn't read it directly (and shouldn't). We use the xapian library to read it. It breaks if the zim is split because xapian lib cannot read split index.

If something is missing in the spec, please open a specific issue.

@data-man
Copy link
Contributor

data-man commented May 4, 2020

@mgautierfr

And I don't understand the third-party containers point

I mean various replacements for std::map.
Personally I prefer Tessil's libraries.
E.g. hat-trie is very compacted and fast especially for strings.

@mossroy
Copy link

mossroy commented May 4, 2020

need for help from someone knowledgeable in compiling C++ (ideally with Emscripten)

Maybe we can help (depending of our priorities). While it is probably doable for libzim itself, I am a bit affraid of the dependencies compilation (icu4c, xapian, ...)

I managed to compile both libzim and kiwix-lib with emscripten (see Makefile of https://github.com/mossroy/libzim_wasm/), but in a quick-and-dirty way. It needed to compile these dependencies too.

@kelson42
Copy link
Contributor

Old zim file can be detected because of the existence of A namespace

I would really prefer to get things based on a minor version udpate of the ZIM format

[Question] Should we allow an entry to be in several categories ?

I think yes. I have a bit difficulties how this looks like, in particular how we display things? The label of the category? The pagination of categories having 100.000 entries? The layout?

The solution I propose is to store the extra data in a blob and, as extra parameter, add the cluster/blob number in the direct for the extradata.
So, it will be two blobs per entry (at least entries with extradata).

I don't really get that part. You mean store it in the dirent? Would that be compressed?

@mgautierfr
Copy link
Collaborator Author

I would really prefer to get things based on a minor version udpate of the ZIM format

Indeed

I don't really get that part. You mean store it in the dirent? Would that be compressed?

The dirent can store extra parameter (called parameter in the spec). We don't use it for now.
The extra data would be stored in a blob in a cluster (it could be compressed or not depending of the cluster[1]). The blob address (cluster number and blob number in that cluster) would be stored in the dirent as a parameter.

I think yes. I have a bit difficulties how this looks like, in particular how we display things? The label of the category? The pagination of categories having 100.000 entries? The layout?

The category's content will be a binary content stored in a blob. It will be a list of article indexes.
This content is meant to be read by libzim itself and presented to the user code through a specific API (iterator, ...).
It would be up to the client to use this information as it want (kiwix-serve may generate a html view, android or ios client may use a specific view, ...)

The question (several categories per article) is more about to find in which categories is an article.
If we have only one category, we can add the category index in the dirent (using parameters). But if there are several categories, we need something a bit more complex. We could add the list of categories in the extradata but then we mix a user content (user extradata) and a zim internal information.


[1] In which cluster we are storing the extradata is an open question. If extradata are use to get information about an article without always acceding the article (listing, search, ...) it is better to use a specific cluster (potentially not compressed) for the extradata. If they are use in the same time as the article, it is better to store them in the same cluster of the article. I'm not sure we can decide that on the libzim side.

@kelson42
Copy link
Contributor

Implementation has started

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants