HTML writer - support new table features #6314

jgm · 2020-04-24T01:25:01Z

Add support for new table features introduced in
jgm/pandoc-types#66
including table attributes (including identifier), rowspan, colspan, table head and foot, multiple header lines, row headers, captions that allow block-level content and include an optional short caption.

tarleb · 2020-08-24T14:00:39Z

I'm starting to work on this.

danlobo02 · 2020-08-24T15:31:09Z

subscribing

danlobo02 · 2020-08-24T15:35:15Z

Is this related or is there a ticket for html reader to support jgm/pandoc-types#66 ?

tarleb · 2020-08-24T15:42:45Z

The ticket for the HTML reader is #6312. It has a "Good first issue" label, but I'm not sure if that really applies.

tarleb · 2020-08-25T15:10:04Z

Progress notes: The hardest problem I'm facing is that, in order to apply the correct alignment to each cell, and row number to each row, we need to have a good idea of how the table grid will look. This is has become much more difficult with the new table structure. My current line of attack is building a separate T.P.Writers.Tables module which allows to create a rectangular GridTable. This should allow to reuse large parts of the current table writer functions, but makes it easy to find a cell's column and row number.

-- | Table row offset (i.e., the number of rows which have to be
-- moved up to find the topmost row belonging to a cell).
newtype RowOffset = RowOffset Int
  deriving (Eq, Num, Enum)

-- | Table column offset (i.e., the number of column which have to
-- be moved left to find the leftmost column belonging to a cell).
newtype ColOffset = ColOffset Int
  deriving (Eq, Num, Enum)

-- | Rectangular table which makes it easy to match table cells
-- with their column and row numbers.
newtype GridTable = GridTable [GridRow]

-- | Single row of a 'GridTable'.
newtype GridRow = GridRow [GridCell]

-- | Single cell of a 'GridTable'. Usually, only cells with zero
-- offsets should be rendered. Other cells serve as placeholders.
data GridCell = GridCell RowOffset ColOffset Cell

toGridTable :: [Row] -> GridTable
toGridTable = undefined -- WIP

This should be useful for other writers as well.

I'd love to learn about alternative approaches and ideas.

jgm · 2020-08-25T17:27:15Z

It would be great to get feedback from @despresc on this.

kysko · 2020-08-26T15:44:37Z

I wrote an experimental HTML table writer as a Lua filter in the past few weeks, just for the heck of it, knowing it would be useless once there was an official writer for that; so didn't publish it, but maybe now could give ideas. (Sorry if a bit messy.)
(I included some native formats on which I did my testings)

I encountered a similar problem to know the ~~"true cell"~~ "true column". It's not in Haskell, but maybe it could help.

In particular, it's in the "Table Environment" function/pseudoclass col_env, more exactly the .tcols part, which is used in the function process_cell, in its "Columns State" section.

Basically, I keep track of which cells are occupied across rows if there are row spans; as I go along reading each cell info, I write their row spans across their col spans, and as I advanced through the rows, there is a decrement to indicate where we are in the current cell occupancy at each row start; an occupancy of 0 means it is safe to write to it.

For example, we could have at the beginning of processing a row (for five cell row):

1 1 2 0 1

which means that the first 3 cells and the fifth are still "occupied" by row spanning cells from above, and only the 4th cell can be used to write to.
The cell to write should only have a col span of 1 (I have to trust that pandoc's native reader would disallow malformed tables), and if it has a row span of 2, at the beginning of next row we'll have:

0 0 1 1 0

where now the 3rd and 4th cells are occupied by above cell's row span, all other are "free".

So I didn't need to know in advance how the grid would look like, keeping track as above was sufficient.

No idea how this would translate to Haskell.

/Edit: oops, my code has a problem when a particular row has only cells from row spans... so there's need for a "true row" also...
/Edit2: quick fixed, a bit ugly

tarleb · 2020-08-27T09:44:04Z

Thanks, this is helpful. The algorithm I came up with looks similar, with the difference that I'm trying to avoiding mutation. I'm considering to switch to your approach, the occupancy idea is nice.

I'd like to reuse your test files, but you'd have to license them as GPL2-or-later. Would you be so kind to make a PR to add them to the repo?

kysko · 2020-08-27T15:35:07Z

I've seen later that this "occupancy idea" is probably similar to what desprec originally did

https://github.com/jgm/pandoc-types/blob/dc56b9a9678843649a6b1b50d255cc689fba4412/src/Text/Pandoc/Builder.hs#L648-L659

he calls those "overhangs". Probably a better starting place for you in Haskell.

As for the test files, I've tried to give a very permissive license (public domain) so one could use them as they will. And since most of them are transformed from others' source, I'm not sure if I can impose a stricter license.
(e.g. NASA (license?) and later Mozilla (probably CC0/PD) for the planets table, from which I basically regexed into native format; etc, I've given links to most of them)

Maybe a double license "public domain/gpl2" would do, in pandoc-lua-filters?

tarleb · 2020-08-29T14:29:10Z

Thanks. I used the "planets" table, which is indeed CC0.

despresc · 2020-08-29T21:57:20Z

Yes, I found the occupancy/overhang method to be the easiest for dealing with these tables. You could use it to build up a [[(GridPosition, Cell)]]list, but for alignments it's probably easier to use the colspec list as a stack and pop off elements as you move right along the grid row.

I can't remember what we decided for determining cell alignments when the cell had an AlignDefault alignment and it stretched over multiple columns with inconsistent alignments.

One thing I should note is that I assumed that no cells would be moved upward or downward while the table was being laid onto the grid. They just get clipped or dropped to fit in the available space. That means that the row number of a cell should be the same as the row number of its parent row (its index in the [Row] list). From what I recall that's what the HTML spec says to try to do, but I could be wrong about that.

despresc · 2020-08-29T22:07:16Z

To clarify, I think that that cell placement behaviour on grid rows agrees with the HTML spec on well-formed tables (no overlapping cells, no empty rows, etc.). For invalid tables I think they simply call it a "table model error" and decline to specify what to do.

kysko · 2020-08-29T22:59:36Z

(...) no empty rows, etc. (...)

So how would you render the following (abridged) native format

 Row [Cell AlignDefault (RowSpan 3) (ColSpan 1) [Plain [Str "3x1"]]
     ,Cell AlignDefault (RowSpan 1) (ColSpan 1) [Plain [Str "1x1"]]]
,Row [Cell AlignDefault (RowSpan 2) (ColSpan 1) [Plain [Str "2x1"]]]
,Row [Cell AlignDefault (RowSpan 1) (ColSpan 1) [Plain [Str "1x1"]]
     ,Cell AlignDefault (RowSpan 1) (ColSpan 1) [Plain [Str "1x1"]]]

(see zmultirowspan_native.txt in my link above for full version)

A "direct translation" would be this:

<tr>
<td rowspan="3">3x1</td>
<td>1x1</td>
</tr>
<tr>
<td rowspan="2">2x1</td>
</tr>
<tr>
<td>1x1</td>
<td>1x1</td>
</tr>

but the resulting table as seen in a browser is clearly wrong (Firefox, Vivaldi, IE, Edge).
However, inserting one empty row where the double row spans from above "overhang", gives what is expected after browser rendering (more than one empty row could be needed for larger "overhangs"):

<tr>
<td rowspan="3">3x1</td>
<td>1x1</td>
</tr>
<tr>
<td rowspan="2">2x1</td>
</tr>
<tr></tr> <!-- empty, cells occupied by above row spanned cells -->
<tr>
<td>1x1</td>
<td>1x1</td>
</tr>

and in that case the last row doesn't have the same (HTML) index as the one from the [Row] list.

/Edit, ah, but I do see what you mentioned in the w3.org's Tabular data:

If there exists a row or column in the table the table containing only slots that do not have a cell anchored to them, then this is a table model error.

despresc · 2020-08-29T23:52:34Z

On the face of it the table you gave is this:

+---+---+
|   |   |
+   +---+
|   |   |
+   +   +
|   |   |
+---+---+    -|
|   |   |     | inferred grid dimensions inconsistent with number of rows given
+---+---+    -|

Right now the code lays the table on a grid with height equal to the length of the [Row] list. So the "normalized" version of that table that the functions in pandoc-types would produce, and what that Table is assumed to really represent, is:

+---+---+
|   |   |
+   +---+
|   |   |
+   +   +
|   |   |
+---+---+

They would drop all the cells in the last row, leaving a Row [], since there is no space for the cells. That's assuming that the length of the colspec list is 2, of course.

despresc · 2020-08-30T00:15:16Z

I suppose after laying out a row you could check if the overhang in each column is > 1 and insert sufficient empty rows after that row to prevent this dropping, as you do, but some of the other table handling functions would need to be changed to keep the model consistent.

kysko · 2020-08-30T00:16:01Z

/Edit: oops, published while you were writing your previous comment...

Except that presently, giving pandoc that native code will give back that same table (with the last row intact) when outputted in native or json format. And in a Lua filter, I also can access that last row, so it is read and accepted, not discarded. Or I may misunderstand what you say.
The problem is also independent of pandoc and the new table structure: if I put my two HTML examples above in a validator, the first one gives a warning about a row with four columns, while the second gives an error about an empty row.

Anyway, all this is about testing edge cases, and I only stumbled on that case after posting here.
The lesson here (at least for HTML) is to avoid cases where all cells in a row would be used by above cells spanning rows: in such a case, the person (or code) generating the table could just use less row spans. In my above example:

<tr>
<td rowspan="2">3x1</td>
<td>1x1</td>
</tr>
<tr>
<td>2x1</td>
</tr>
<tr>
<td>1x1</td>
<td>1x1</td>
</tr>

kysko · 2020-08-30T00:24:37Z

insert sufficient empty rows after that row to prevent this dropping, as you do

I did it because that is how the browser accepted it, so I thought it was how it was supposed to be. Now I see such cases are malformed tables.

So the question is whether to do as browsers do, and accept and create those (formally invalid) empty rows to produce what was visually intended, or reject the table as malformed...

despresc · 2020-08-30T00:41:53Z

You are right. The native/json/lua readers and writers take in and emit the tables directly, without other processing. All the other readers and writers do actually transform the tables like I described, or at least they did formerly.

The pandoc Table type is loose, in the sense that there are many ways to represent any particular table. But any native Table is intended to represent one single normalized table. That happens to be the Table itself if it's valid. Otherwise, to get to that intended table there needs to be additional processing, which currently involves clipping or dropping cells, or adding padding cells. Nothing is done about empty rows (inserting or deleting them), since eliminating those kinds of errors entirely is difficult (or it seemed that way to me when I wrote the relevant functions).

despresc · 2020-08-30T00:54:12Z

Sorry, that's not quite all of it. The readers and writers do perform those transformations, but a lot of it isn't apparent because they don't yet support row and column spans.

despresc · 2020-08-30T00:57:17Z

So for readers, table handling currently looks like

Read in simplified table
Convert simplified table into a real Table
Transform that Table like I described

and for writers that process runs in reverse.

kysko · 2020-08-30T01:15:49Z

So, the native form should be the result of a "normalization",
and by being impatient and using my own native input I circumvented that "normalization" -- at both ends !
or something like that.

despresc · 2020-08-30T02:43:54Z

More or less. The non-native readers all try to produce internal tables that are reasonably nice if they aren't already (free from cell overlap errors, at least), and the non-native writers do not assume that they will be given nice internal tables, so they will also try to make sense of them as best they can. They do this consistently by interpreting the internal Table format according to one fixed table model. Since the native readers and writers deal with internal tables directly, that process doesn't happen with them.

tarleb · 2020-09-12T18:52:44Z

Re-adding this comment after I had first misplaced it in the issue for the HTML reader:

Colspans and rowspans have been added in #6644. Table features which have not been added yet:

intermediate headers
footers
attributes on all elements for which the information is available

Part of: jgm#6314

jgm changed the title ~~LaTeX writer - support new table features~~ HTML writer - support new table features Apr 24, 2020

jgm added format:HTML writer labels Apr 24, 2020

jgm mentioned this issue Apr 24, 2020

List of projects #5581

Closed

9 tasks

reagle mentioned this issue May 15, 2020

Support for table column spans, table attributes in AST #1024

Closed

mb21 added the good first issue label Jun 10, 2020

tarleb removed the good first issue label Aug 24, 2020

lrosenthol mentioned this issue Aug 24, 2020

ICML Writer - Support new table features #6615

Open

tarleb mentioned this issue Aug 28, 2020

Support colspans and rowspans in HTML tables #6644

Merged

tarleb added a commit to tarleb/pandoc that referenced this issue Sep 12, 2020

HTML writer: render table footers if present

4fefd73

Part of: jgm#6314

tarleb added a commit to tarleb/pandoc that referenced this issue Sep 12, 2020

HTML writer: render table footers if present

a400d0d

Part of: jgm#6314

tarleb closed this as completed in 34151e8 Sep 13, 2020

jgm mentioned this issue Sep 27, 2020

Document which table features are supported in which formats #6701

Open

tarleb mentioned this issue Dec 1, 2020

tests/sample_files/native/*.native: normalized sergiocorreia/panflute#172

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML writer - support new table features #6314

HTML writer - support new table features #6314

jgm commented Apr 24, 2020 •

edited

Loading

tarleb commented Aug 24, 2020

danlobo02 commented Aug 24, 2020

danlobo02 commented Aug 24, 2020 •

edited

Loading

tarleb commented Aug 24, 2020 •

edited

Loading

tarleb commented Aug 25, 2020 •

edited

Loading

jgm commented Aug 25, 2020

kysko commented Aug 26, 2020 •

edited

Loading

tarleb commented Aug 27, 2020

kysko commented Aug 27, 2020

tarleb commented Aug 29, 2020

despresc commented Aug 29, 2020 •

edited

Loading

despresc commented Aug 29, 2020

kysko commented Aug 29, 2020 •

edited

Loading

despresc commented Aug 29, 2020 •

edited

Loading

despresc commented Aug 30, 2020 •

edited

Loading

kysko commented Aug 30, 2020 •

edited

Loading

kysko commented Aug 30, 2020

despresc commented Aug 30, 2020

despresc commented Aug 30, 2020

despresc commented Aug 30, 2020

kysko commented Aug 30, 2020

despresc commented Aug 30, 2020

tarleb commented Sep 12, 2020 •

edited

Loading

HTML writer - support new table features #6314

HTML writer - support new table features #6314

Comments

jgm commented Apr 24, 2020 • edited Loading

tarleb commented Aug 24, 2020

danlobo02 commented Aug 24, 2020

danlobo02 commented Aug 24, 2020 • edited Loading

tarleb commented Aug 24, 2020 • edited Loading

tarleb commented Aug 25, 2020 • edited Loading

jgm commented Aug 25, 2020

kysko commented Aug 26, 2020 • edited Loading

tarleb commented Aug 27, 2020

kysko commented Aug 27, 2020

tarleb commented Aug 29, 2020

despresc commented Aug 29, 2020 • edited Loading

despresc commented Aug 29, 2020

kysko commented Aug 29, 2020 • edited Loading

despresc commented Aug 29, 2020 • edited Loading

despresc commented Aug 30, 2020 • edited Loading

kysko commented Aug 30, 2020 • edited Loading

kysko commented Aug 30, 2020

despresc commented Aug 30, 2020

despresc commented Aug 30, 2020

despresc commented Aug 30, 2020

kysko commented Aug 30, 2020

despresc commented Aug 30, 2020

tarleb commented Sep 12, 2020 • edited Loading

jgm commented Apr 24, 2020 •

edited

Loading

danlobo02 commented Aug 24, 2020 •

edited

Loading

tarleb commented Aug 24, 2020 •

edited

Loading

tarleb commented Aug 25, 2020 •

edited

Loading

kysko commented Aug 26, 2020 •

edited

Loading

despresc commented Aug 29, 2020 •

edited

Loading

kysko commented Aug 29, 2020 •

edited

Loading

despresc commented Aug 29, 2020 •

edited

Loading

despresc commented Aug 30, 2020 •

edited

Loading

kysko commented Aug 30, 2020 •

edited

Loading

tarleb commented Sep 12, 2020 •

edited

Loading