Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request only parts of PDF document #1108

Closed
jviereck opened this issue Jan 21, 2012 · 5 comments
Closed

Request only parts of PDF document #1108

jviereck opened this issue Jan 21, 2012 · 5 comments

Comments

@jviereck
Copy link
Contributor

This is a short summary of some thought I have made on how to not download the entire PDF file, but only the necessary parts for rendering. This is very important, if the PDF document is very large, therefore downloading takes a while, but the user might only few a very small subset of the entire pages.

Some PDFs have the notion of an 'linerization' "header", that is (if available) in the first 1024 bytes of the document and contains information about the entire document size + where some stuff is located in the pdf file. To make some things easier, let's concentrate on the case where the PDF is linerized.

If the PDF document is linerized, then the data required to render the first page is grouped at the beginning of the PDF document and fetching this part of the PDF document is therefore easy. However, there is no such information for the other pages. In Chrome, if you view a linerized PDF you will see the first page displayed, while the PDF is still being downloaded. Of cause this is better then what PDF.JS can do right now, but I'd like to have a solution that works for all pages equal.

Here is my basic idea: If we fetch a PDF, we check if it's linerized. If it isn't, we just do what we do right now (download the entire PDF), otherwise we download only the ranges, that contain the XRef information (the linerization information tells us where that XRef data is located). Once we have the XRef data, we know which parts of the document we have to fetch to extract a certain object.

In the current code, to get an object, we call XRef.fetch (or .fetchIfRef), which creates a new substeam of the main PDF stream, does some processing and returns the finished object. In a world where we only load chunks of the document, there might be the case that the parts oft he document required to fetch a certain object is not ready yet.

The good thing is, that the XRef table tells us exactly where an object begins and where it starts. That means, whenever we request to fetch an object that wasn't fetched yet, we know exactly which ranges to request from the server.

As we can't be sure if all the data is around to fetch an object, we normally have to think about using promises whenever we fetch a new PDF object. This will result into very nasty to debug call chains and I doubt this will be maintainable (not speaking about the work required to go over the entire codebase and make it work with async…). To solve this issue, one could add "continuation" areas. Imagine the getIRQueue function is a continuation area. Once we enter the function, we can store the parser, resource ect. If we detect that an object is not available in any function called from the getIRQueue function (imagine some color space is missing), the "continuation" area is notified and once the missing data/object arrives, it knows where to continue. Note that this doesn't mean we continue directly at the line the color space was missing, but at a point way higher in the call stack. That's slower (as we have to do some of the computation twice), but makes life way easier I guess. To notify an "continuation" area that an object is missing, the XRef.fetch function could just throw an MissingDataError, the "continuation" area uses a try-catch to watch for these kind of errors and knows how to deal with them.

Instead of fetching only one object at the time, it might be better to fetch some chunk of the PDF (like 100kb?) and get some of the following objects as well and not just the one object that is missing. My hope is, that by fetching a larger chunk of missing PDF, there are less request required to make and such getting a missing page might only require to run 3-4 extra range requests agains the server (requests don't come for free as there is network latency).

The easiest way to analysis if implementing something like this is useful at all, might be to record which objects are required to render a page and calculate how many requests are necessary to get all the objects required. If it turns out, that only a small number of requests is required, then this feature might be worth implementing, but if it turns out there are a lot of requests and the network latency is not worth downloading the entire file, we should reconsider going for this feature. Also, it's the request how many server support making a range request.

PS: This is only a small summary of what I've thought through already, but I'm not sure how much it's worth writing down right now. This should give move of a very basic overview what I was thinking about.

PPS: My exams start on Monday, so I can't pay to much attention to this issue, but I wanted to dump what's in my brain so far :/

@wemakeweb
Copy link

+1 chunked download, Whats the status of that?

@andrewseddon
Copy link

+1

1 similar comment
@podviaznikov
Copy link

+1

@mduan
Copy link
Contributor

mduan commented Feb 11, 2013

Just a heads up that I am currently doing some work on this. I have an implementation that works right now, but it's rather slow so far. But there's a lot of optimizations I'm currently working on by adding more "continuation" areas.

@brendandahl
Copy link
Contributor

Closing since we landed the above PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants