Request only parts of PDF document #1108

jviereck · 2012-01-21T21:32:41Z

This is a short summary of some thought I have made on how to not download the entire PDF file, but only the necessary parts for rendering. This is very important, if the PDF document is very large, therefore downloading takes a while, but the user might only few a very small subset of the entire pages.

Some PDFs have the notion of an 'linerization' "header", that is (if available) in the first 1024 bytes of the document and contains information about the entire document size + where some stuff is located in the pdf file. To make some things easier, let's concentrate on the case where the PDF is linerized.

If the PDF document is linerized, then the data required to render the first page is grouped at the beginning of the PDF document and fetching this part of the PDF document is therefore easy. However, there is no such information for the other pages. In Chrome, if you view a linerized PDF you will see the first page displayed, while the PDF is still being downloaded. Of cause this is better then what PDF.JS can do right now, but I'd like to have a solution that works for all pages equal.

Here is my basic idea: If we fetch a PDF, we check if it's linerized. If it isn't, we just do what we do right now (download the entire PDF), otherwise we download only the ranges, that contain the XRef information (the linerization information tells us where that XRef data is located). Once we have the XRef data, we know which parts of the document we have to fetch to extract a certain object.

In the current code, to get an object, we call XRef.fetch (or .fetchIfRef), which creates a new substeam of the main PDF stream, does some processing and returns the finished object. In a world where we only load chunks of the document, there might be the case that the parts oft he document required to fetch a certain object is not ready yet.

The good thing is, that the XRef table tells us exactly where an object begins and where it starts. That means, whenever we request to fetch an object that wasn't fetched yet, we know exactly which ranges to request from the server.

As we can't be sure if all the data is around to fetch an object, we normally have to think about using promises whenever we fetch a new PDF object. This will result into very nasty to debug call chains and I doubt this will be maintainable (not speaking about the work required to go over the entire codebase and make it work with async…). To solve this issue, one could add "continuation" areas. Imagine the getIRQueue function is a continuation area. Once we enter the function, we can store the parser, resource ect. If we detect that an object is not available in any function called from the getIRQueue function (imagine some color space is missing), the "continuation" area is notified and once the missing data/object arrives, it knows where to continue. Note that this doesn't mean we continue directly at the line the color space was missing, but at a point way higher in the call stack. That's slower (as we have to do some of the computation twice), but makes life way easier I guess. To notify an "continuation" area that an object is missing, the XRef.fetch function could just throw an MissingDataError, the "continuation" area uses a try-catch to watch for these kind of errors and knows how to deal with them.

Instead of fetching only one object at the time, it might be better to fetch some chunk of the PDF (like 100kb?) and get some of the following objects as well and not just the one object that is missing. My hope is, that by fetching a larger chunk of missing PDF, there are less request required to make and such getting a missing page might only require to run 3-4 extra range requests agains the server (requests don't come for free as there is network latency).

The easiest way to analysis if implementing something like this is useful at all, might be to record which objects are required to render a page and calculate how many requests are necessary to get all the objects required. If it turns out, that only a small number of requests is required, then this feature might be worth implementing, but if it turns out there are a lot of requests and the network latency is not worth downloading the entire file, we should reconsider going for this feature. Also, it's the request how many server support making a range request.

PS: This is only a small summary of what I've thought through already, but I'm not sure how much it's worth writing down right now. This should give move of a very basic overview what I was thinking about.

PPS: My exams start on Monday, so I can't pay to much attention to this issue, but I wanted to dump what's in my brain so far :/

The text was updated successfully, but these errors were encountered:

wemakeweb · 2012-10-16T09:42:42Z

+1 chunked download, Whats the status of that?

andrewseddon · 2013-02-11T19:50:17Z

+1

podviaznikov · 2013-02-11T19:54:20Z

+1

mduan · 2013-02-11T20:17:43Z

Just a heads up that I am currently doing some work on this. I have an implementation that works right now, but it's rather slow so far. But there's a lot of optimizations I'm currently working on by adding more "continuation" areas.

brendandahl · 2013-04-18T21:13:14Z

Closing since we landed the above PR.

arturadib mentioned this issue Feb 28, 2012

Experiment with XHR's new 'chunked' feature #485

Closed

jviereck mentioned this issue Mar 21, 2012

Begin drawing pages while still downloading data? #1375

Closed

jviereck mentioned this issue Apr 3, 2012

API draft #1100

Closed

gigaherz mentioned this issue Apr 5, 2012

Canvas scaling affects the fillText quality in Opera and Chrome #1477

Closed

gigaherz mentioned this issue Apr 18, 2012

Large PDF Files -> Memory problem on iPad #1572

Closed

gigaherz mentioned this issue Feb 1, 2013

Progressive loading/rendering of PDFs #2653

Closed

mduan mentioned this issue Mar 6, 2013

Implement progressive loading of PDFs #2719

Merged

10 tasks

brendandahl closed this as completed Apr 18, 2013

mustafa0x mentioned this issue Mar 6, 2018

Full PDF doc loaded before single page could be rendered #9537

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request only parts of PDF document #1108

Request only parts of PDF document #1108

jviereck commented Jan 21, 2012

wemakeweb commented Oct 16, 2012

andrewseddon commented Feb 11, 2013

podviaznikov commented Feb 11, 2013

mduan commented Feb 11, 2013

brendandahl commented Apr 18, 2013

Request only parts of PDF document #1108

Request only parts of PDF document #1108

Comments

jviereck commented Jan 21, 2012

wemakeweb commented Oct 16, 2012

andrewseddon commented Feb 11, 2013

podviaznikov commented Feb 11, 2013

mduan commented Feb 11, 2013

brendandahl commented Apr 18, 2013