Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML writer - support new table features #6314

Closed
jgm opened this issue Apr 24, 2020 · 23 comments
Closed

HTML writer - support new table features #6314

jgm opened this issue Apr 24, 2020 · 23 comments

Comments

@jgm
Copy link
Owner

jgm commented Apr 24, 2020

Add support for new table features introduced in
jgm/pandoc-types#66
including table attributes (including identifier), rowspan, colspan, table head and foot, multiple header lines, row headers, captions that allow block-level content and include an optional short caption.

@jgm jgm changed the title LaTeX writer - support new table features HTML writer - support new table features Apr 24, 2020
@jgm jgm mentioned this issue Apr 24, 2020
9 tasks
@tarleb
Copy link
Collaborator

tarleb commented Aug 24, 2020

I'm starting to work on this.

@danlobo02
Copy link

subscribing

@danlobo02
Copy link

danlobo02 commented Aug 24, 2020

Is this related or is there a ticket for html reader to support jgm/pandoc-types#66 ?

@tarleb
Copy link
Collaborator

tarleb commented Aug 24, 2020

The ticket for the HTML reader is #6312. It has a "Good first issue" label, but I'm not sure if that really applies.

@tarleb
Copy link
Collaborator

tarleb commented Aug 25, 2020

Progress notes: The hardest problem I'm facing is that, in order to apply the correct alignment to each cell, and row number to each row, we need to have a good idea of how the table grid will look. This is has become much more difficult with the new table structure. My current line of attack is building a separate T.P.Writers.Tables module which allows to create a rectangular GridTable. This should allow to reuse large parts of the current table writer functions, but makes it easy to find a cell's column and row number.

-- | Table row offset (i.e., the number of rows which have to be
-- moved up to find the topmost row belonging to a cell).
newtype RowOffset = RowOffset Int
  deriving (Eq, Num, Enum)

-- | Table column offset (i.e., the number of column which have to
-- be moved left to find the leftmost column belonging to a cell).
newtype ColOffset = ColOffset Int
  deriving (Eq, Num, Enum)

-- | Rectangular table which makes it easy to match table cells
-- with their column and row numbers.
newtype GridTable = GridTable [GridRow]

-- | Single row of a 'GridTable'.
newtype GridRow = GridRow [GridCell]

-- | Single cell of a 'GridTable'. Usually, only cells with zero
-- offsets should be rendered. Other cells serve as placeholders.
data GridCell = GridCell RowOffset ColOffset Cell

toGridTable :: [Row] -> GridTable
toGridTable = undefined -- WIP

This should be useful for other writers as well.

I'd love to learn about alternative approaches and ideas.

@jgm
Copy link
Owner Author

jgm commented Aug 25, 2020

It would be great to get feedback from @despresc on this.

@kysko
Copy link

kysko commented Aug 26, 2020

I wrote an experimental HTML table writer as a Lua filter in the past few weeks, just for the heck of it, knowing it would be useless once there was an official writer for that; so didn't publish it, but maybe now could give ideas. (Sorry if a bit messy.)
(I included some native formats on which I did my testings)

I encountered a similar problem to know the "true cell" "true column". It's not in Haskell, but maybe it could help.

In particular, it's in the "Table Environment" function/pseudoclass col_env, more exactly the .tcols part, which is used in the function process_cell, in its "Columns State" section.

Basically, I keep track of which cells are occupied across rows if there are row spans; as I go along reading each cell info, I write their row spans across their col spans, and as I advanced through the rows, there is a decrement to indicate where we are in the current cell occupancy at each row start; an occupancy of 0 means it is safe to write to it.

For example, we could have at the beginning of processing a row (for five cell row):

1 1 2 0 1

which means that the first 3 cells and the fifth are still "occupied" by row spanning cells from above, and only the 4th cell can be used to write to.
The cell to write should only have a col span of 1 (I have to trust that pandoc's native reader would disallow malformed tables), and if it has a row span of 2, at the beginning of next row we'll have:

0 0 1 1 0

where now the 3rd and 4th cells are occupied by above cell's row span, all other are "free".

So I didn't need to know in advance how the grid would look like, keeping track as above was sufficient.

No idea how this would translate to Haskell.


/Edit: oops, my code has a problem when a particular row has only cells from row spans... so there's need for a "true row" also...
/Edit2: quick fixed, a bit ugly

@tarleb
Copy link
Collaborator

tarleb commented Aug 27, 2020

Thanks, this is helpful. The algorithm I came up with looks similar, with the difference that I'm trying to avoiding mutation. I'm considering to switch to your approach, the occupancy idea is nice.

I'd like to reuse your test files, but you'd have to license them as GPL2-or-later. Would you be so kind to make a PR to add them to the repo?

@kysko
Copy link

kysko commented Aug 27, 2020

I've seen later that this "occupancy idea" is probably similar to what desprec originally did

https://github.com/jgm/pandoc-types/blob/dc56b9a9678843649a6b1b50d255cc689fba4412/src/Text/Pandoc/Builder.hs#L648-L659

he calls those "overhangs". Probably a better starting place for you in Haskell.


As for the test files, I've tried to give a very permissive license (public domain) so one could use them as they will. And since most of them are transformed from others' source, I'm not sure if I can impose a stricter license.
(e.g. NASA (license?) and later Mozilla (probably CC0/PD) for the planets table, from which I basically regexed into native format; etc, I've given links to most of them)

Maybe a double license "public domain/gpl2" would do, in pandoc-lua-filters?

@tarleb
Copy link
Collaborator

tarleb commented Aug 29, 2020

Thanks. I used the "planets" table, which is indeed CC0.

@despresc
Copy link
Contributor

despresc commented Aug 29, 2020

Yes, I found the occupancy/overhang method to be the easiest for dealing with these tables. You could use it to build up a [[(GridPosition, Cell)]]list, but for alignments it's probably easier to use the colspec list as a stack and pop off elements as you move right along the grid row.

I can't remember what we decided for determining cell alignments when the cell had an AlignDefault alignment and it stretched over multiple columns with inconsistent alignments.

One thing I should note is that I assumed that no cells would be moved upward or downward while the table was being laid onto the grid. They just get clipped or dropped to fit in the available space. That means that the row number of a cell should be the same as the row number of its parent row (its index in the [Row] list). From what I recall that's what the HTML spec says to try to do, but I could be wrong about that.

@despresc
Copy link
Contributor

To clarify, I think that that cell placement behaviour on grid rows agrees with the HTML spec on well-formed tables (no overlapping cells, no empty rows, etc.). For invalid tables I think they simply call it a "table model error" and decline to specify what to do.

@kysko
Copy link

kysko commented Aug 29, 2020

(...) no empty rows, etc. (...)

So how would you render the following (abridged) native format

 Row [Cell AlignDefault (RowSpan 3) (ColSpan 1) [Plain [Str "3x1"]]
     ,Cell AlignDefault (RowSpan 1) (ColSpan 1) [Plain [Str "1x1"]]]
,Row [Cell AlignDefault (RowSpan 2) (ColSpan 1) [Plain [Str "2x1"]]]
,Row [Cell AlignDefault (RowSpan 1) (ColSpan 1) [Plain [Str "1x1"]]
     ,Cell AlignDefault (RowSpan 1) (ColSpan 1) [Plain [Str "1x1"]]]

(see zmultirowspan_native.txt in my link above for full version)

A "direct translation" would be this:

<tr>
<td rowspan="3">3x1</td>
<td>1x1</td>
</tr>
<tr>
<td rowspan="2">2x1</td>
</tr>
<tr>
<td>1x1</td>
<td>1x1</td>
</tr>

but the resulting table as seen in a browser is clearly wrong (Firefox, Vivaldi, IE, Edge).
However, inserting one empty row where the double row spans from above "overhang", gives what is expected after browser rendering (more than one empty row could be needed for larger "overhangs"):

<tr>
<td rowspan="3">3x1</td>
<td>1x1</td>
</tr>
<tr>
<td rowspan="2">2x1</td>
</tr>
<tr></tr> <!-- empty, cells occupied by above row spanned cells -->
<tr>
<td>1x1</td>
<td>1x1</td>
</tr>

and in that case the last row doesn't have the same (HTML) index as the one from the [Row] list.


/Edit, ah, but I do see what you mentioned in the w3.org's Tabular data:

  1. If there exists a row or column in the table the table containing only slots that do not have a cell anchored to them, then this is a table model error.

@despresc
Copy link
Contributor

despresc commented Aug 29, 2020

On the face of it the table you gave is this:

+---+---+
|   |   |
+   +---+
|   |   |
+   +   +
|   |   |
+---+---+    -|
|   |   |     | inferred grid dimensions inconsistent with number of rows given
+---+---+    -|

Right now the code lays the table on a grid with height equal to the length of the [Row] list. So the "normalized" version of that table that the functions in pandoc-types would produce, and what that Table is assumed to really represent, is:

+---+---+
|   |   |
+   +---+
|   |   |
+   +   +
|   |   |
+---+---+

They would drop all the cells in the last row, leaving a Row [], since there is no space for the cells. That's assuming that the length of the colspec list is 2, of course.

@despresc
Copy link
Contributor

despresc commented Aug 30, 2020

I suppose after laying out a row you could check if the overhang in each column is > 1 and insert sufficient empty rows after that row to prevent this dropping, as you do, but some of the other table handling functions would need to be changed to keep the model consistent.

@kysko
Copy link

kysko commented Aug 30, 2020

/Edit: oops, published while you were writing your previous comment...

  1. Except that presently, giving pandoc that native code will give back that same table (with the last row intact) when outputted in native or json format. And in a Lua filter, I also can access that last row, so it is read and accepted, not discarded. Or I may misunderstand what you say.

  2. The problem is also independent of pandoc and the new table structure: if I put my two HTML examples above in a validator, the first one gives a warning about a row with four columns, while the second gives an error about an empty row.

Anyway, all this is about testing edge cases, and I only stumbled on that case after posting here.
The lesson here (at least for HTML) is to avoid cases where all cells in a row would be used by above cells spanning rows: in such a case, the person (or code) generating the table could just use less row spans. In my above example:

<tr>
<td rowspan="2">3x1</td>
<td>1x1</td>
</tr>
<tr>
<td>2x1</td>
</tr>
<tr>
<td>1x1</td>
<td>1x1</td>
</tr>

@kysko
Copy link

kysko commented Aug 30, 2020

insert sufficient empty rows after that row to prevent this dropping, as you do

I did it because that is how the browser accepted it, so I thought it was how it was supposed to be. Now I see such cases are malformed tables.

So the question is whether to do as browsers do, and accept and create those (formally invalid) empty rows to produce what was visually intended, or reject the table as malformed...

@despresc
Copy link
Contributor

You are right. The native/json/lua readers and writers take in and emit the tables directly, without other processing. All the other readers and writers do actually transform the tables like I described, or at least they did formerly.

The pandoc Table type is loose, in the sense that there are many ways to represent any particular table. But any native Table is intended to represent one single normalized table. That happens to be the Table itself if it's valid. Otherwise, to get to that intended table there needs to be additional processing, which currently involves clipping or dropping cells, or adding padding cells. Nothing is done about empty rows (inserting or deleting them), since eliminating those kinds of errors entirely is difficult (or it seemed that way to me when I wrote the relevant functions).

@despresc
Copy link
Contributor

Sorry, that's not quite all of it. The readers and writers do perform those transformations, but a lot of it isn't apparent because they don't yet support row and column spans.

@despresc
Copy link
Contributor

So for readers, table handling currently looks like

  1. Read in simplified table
  2. Convert simplified table into a real Table
  3. Transform that Table like I described

and for writers that process runs in reverse.

@kysko
Copy link

kysko commented Aug 30, 2020

So, the native form should be the result of a "normalization",
and by being impatient and using my own native input I circumvented that "normalization" -- at both ends !
or something like that.

@despresc
Copy link
Contributor

More or less. The non-native readers all try to produce internal tables that are reasonably nice if they aren't already (free from cell overlap errors, at least), and the non-native writers do not assume that they will be given nice internal tables, so they will also try to make sense of them as best they can. They do this consistently by interpreting the internal Table format according to one fixed table model. Since the native readers and writers deal with internal tables directly, that process doesn't happen with them.

@tarleb
Copy link
Collaborator

tarleb commented Sep 12, 2020

Re-adding this comment after I had first misplaced it in the issue for the HTML reader:

Colspans and rowspans have been added in #6644. Table features which have not been added yet:

  • intermediate headers
  • footers
  • attributes on all elements for which the information is available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants