Improving tables #65

despresc · 2020-03-21T19:17:36Z

See the main todo list and the relevant issue. I would like to start implementing better table handling in Pandoc. Specifically, I would implement all but the last of these bullet points using one of the designs below (or a modified version of one of them).

I think something like this recently outlined approach is a good way forward for now. The representation is a little loose (any table in the intermediate representation is valid, so there are multiple ways to write a given table, but only one normalized way), but it should allow the readers and writers to be switched more easily. This is slightly modified version of that approach:

type RowSpan = Int
type ColSpan = Int
type Caption = [Block]
type ShortCaption = [Inline]
type ColWidth = Maybe Double
data CellType = DataCell | HeaderCell
data Cell = Cell Attr CellType (Maybe Aligment) RowSpan ColSpan [Block]
type Row = [Cell]

data Block =
  ...
  | Table Attr Caption ShortCaption [(Alignment, ColWidth)] [Row]
  ...

The Maybe Alignment on the individual cells allows the cells to override the alignment of the column(s) in which they reside. This makes it easier to specify one's intentions when a cell spans multiple columns with conflicting alignments, and has the advantage of allowing better \multicolumn and \multirow support in the LaTeX reader and writer. It also comes up naturally when one thinks of possible extensions to the supported markdown table formats.

A similar design has the following modifications:

data Cell = Cell Attr (Maybe Alignment) RowSpan ColSpan [Block]
data HeaderRow = Row Attr [Cell]
data BodyRow = Row Attr [Cell] [Cell]

data Block =
  ...
  | Table Attr Caption ShortCaption [(Alignment, ColWidth)] [HeaderRow] [BodyRow] [HeaderRow]
  ...

This has the advantage of making explicit the table head/body/foot and row head/body structure that seems to be assumed in the first approach, where the first entirely header rows become the table head, and the last such rows become the table foot. Cells in the head and foot sections would correspond to th cells, and cells in body section would correspond to td cells. It does not require a CellType, but one could still be added, making these even more similar to HTML tables. This approach has the disadvantage of making the table representation more complex.

I assume that the tables are normalized (laid on a grid with a given width so that overlapping cells and empty spaces can be dealt with in the table) like so, informally:

Empty rows are filtered out from the table
The grid has a height equal to the number of rows in the table, and some fixed width.
Rows are laid on the grid from top to bottom.
The top of each cell is as far down on the grid as it is on the table.
The top-left corner of each cell, in turn, is placed on the leftmost empty grid space on the row, if it exists within the grid width, and is otherwise dropped. If it would overlap a cell on a previous row or extend past the remaining grid width, its width (ColSpan) would be lowered to fit. If it would extend past the bottom of the grid, its height (RowSpan) would be lowered to fit.
If there are too few cells in a row to fill the available width, then blank cells are added to the end of the row.

The table head, table foot, row head (the list of row head sections without the row body), and row body (the list of row body sections without the row head) should be normalized independently in any design where these exist (implicitly in the first, or explicitly in the second). The overall table width would be the length of the [(Alignment, ColWidth)] list, and the row head/body width would add to that width. (The row head width would be the width of the first row in the row head).

The text was updated successfully, but these errors were encountered:

jgm · 2020-03-22T02:04:23Z

I'm delighted that you're interested in taking this on. It's one of the top priority improvements for pandoc, but it has been hard to get it done because (a) it's a big change and (b) it's hard to decide what the best type is.

I don't think we should let the perfect be the enemy of the good: we should discuss (b), but we should set a limit to how long we discuss it before just moving ahead with something that will be better than what we have currently. (If needed, we can make further incremental changes in the future.)

More later...

jgm · 2020-03-22T04:06:16Z

The first approach allows any cell to be a header cell. That might be an advantage for representing tables where the left column is the header (not common) -- such tables can't be represented in the second approach -- but it has the disadvantage that many table formats can't represent arbitrary header cells. (HTML is an exception obviously.) So I'm leaning more to the second approach. I don't know how important it is to represent tables where the header is a column rather than a row, and I'm not sure what the cost would be of unrepresentable tables on the first approach.

despresc · 2020-03-22T14:39:03Z

The layout I have in my mind, incidentally, is this:

+---------------------+
|     Table Head      |
+----------+----------+
| Row Head | Row Body |
+----------+----------+
|     Table Foot      |
+---------------------+

since I realized that the second representation might suggest that the row headers are not under the table head. This is also the implicit layout of the first approach.

When you say that "the left column is the header", do you mean that the table is transposed during writing so that the table head rows become columns? Otherwise I think that the row head section could be used as the header. The only oddity would be that a single header line would be split up among multiple rows.

In the first approach, I suppose that after separating out as many sections as the writer supports (table head, foot, row head) the writer would forget about the cell type and simply write the cell content as-is.

jgm · 2020-03-22T19:47:26Z

I wasn't understanding what you mean by Row Head. Now I see you mean a header cell in the left position in a row. And now I notice that you have Row Attr [Cell] [Cell] -- the first group of cells is the row header, the rest the body. OK, that makes sense. More in a bit.

jgm · 2020-03-22T20:33:30Z

I'm wondering whether it would make things easier if the types were a bit more uniform. Rendering a header row will often be almost the same as rendering a body row. What if we just had

data Row = HeaderRow Attr [Cell] {- row heads -} [Cell] {- other cells -}
data Block =
  ...
  | Table Attr Caption ShortCaption [(Alignment, ColWidth)] [Row] {- header -} [Row] {- body -} [Row] {- footer -}

The drawback is that this allows you to represent distinctions that are irrelevant in the header and footer rows. The advantage is that it makes it easier to deal with rows in a uniform way in the code. I'm not really sure about this tradeoff.

If we do go with your original approach, we'll need a different type constructor:

data HeaderRow = HeaderRow Attr [Cell]

Another approach might be:

data Row a = Row a Attr [Cell]
data HeaderRow
data FooterRow
data BodyRow = BodyRow [Cell]
...
  | Table Attr Caption ShortCaption [(Alignment, ColWidth)]
         [Row HeaderRow] [Row BodyRow] [Row FooterRow]

despresc · 2020-03-23T14:57:32Z

Writers that can't represent row headers might find it easier to concatenate the row head and body and operate on an [(Attr, [Cell])], or even a [[Cell]], list.

If there were a uniform row type, then the table picture could be

+-------------------+-------------------+
| TH above row head | TH above row body |
+-------------------+-------------------+
|     Row Head      |     Row Body      |
+-------------------+-------------------+
| TF below row head | TF below row body |
+-------------------+-------------------+

I am not sure if this is a useful distinction, but it does give the row header in the table head some meaning.

jgm · 2020-03-23T16:54:21Z

If I'm understanding you correctly, you are now suggesting a uniform type

data Row = Row Attr [Cell] [Cell]

to be used for the header, body, and footer? That sounds good to me. It's conceivable that some formats could treat "TH above row head" specially.

despresc · 2020-03-23T17:02:10Z

That was my interpretation of the first Row type in your previous comment. I think it's reasonable; at worst it's another distinction in the intermediate representation for writers to ignore to a greater or lesser degree, and the uniformity could be of some benefit, though it's hard to say without actually starting to implement the change.

jgm · 2020-03-23T19:06:37Z

OK, to summarize then:

type RowSpan = Int
type ColSpan = Int
type Caption = [Block]
type ShortCaption = [Inline]
type ColWidth = Maybe Double
data Cell = Cell Attr (Maybe Aligment) RowSpan ColSpan [Block]
type RowHead = [Cell]
type RowBody = [Cell]
data Row = Row Attr RowHead RowBody
type TableHead = [Row]
type TableBody = [Row]
type TableFoot = [Row]
data Block =
  ...
  | Table Attr Caption ShortCaption [(Alignment, ColWidth)] TableHead TableBody TableFoot

@tarleb - what do you think of this?

tarleb · 2020-03-24T12:01:22Z

Looks good to me!

Maybe we could group the arguments to Table more, e.g. by introducing a type data BareTable = BareTable TableHead TableBody TableFoot or something similar?

jgm · 2020-03-24T16:08:40Z

If we want to compress things, I'd prefer something like

data Caption = Caption (Maybe [Inline]) [Block] -- short caption, full caption
type ColSpec = (Alignment, Maybe Double)
data Block =
  ...
  | Table Attr Caption [Colspec] TableHead TableBody TableFoot

And should we consider using newtypes instead of type for things like TableBody and Colspec?

despresc · 2020-03-24T21:57:56Z

Having the caption components bundled together would be good. That bundling might happen anyway with a new Figure block.

I'm not sure how great the benefit of newtyping would be. The ColSpec could be data ColSpec = ColSpec ... I suppose.

jgm · 2020-03-25T00:41:31Z

Advantage of a newtype is that the types then enforce the distinctions. With type synonyms you won't get an error if you put a TableHead where a TableBody should go, etc.
Disadvantage is that it's a bit more cumbersome doing pattern matching, etc. However, it's not too hard, and there's always coerce.

jgm · 2020-03-25T00:56:07Z

That said, we use type aliases all over the place in pandoc-types now (e.g. Attr), so maybe this isn't the time to change.

despresc · 2020-03-25T02:32:25Z

Perhaps it isn't the time to change type/newtype approaches.

It sounds like the most recent summary, with the modified data Caption, is acceptable. If that is the case, I can start working from that design.

jgm · 2020-03-25T04:06:51Z

Sounds good to me!

This implements issue jgm#65 for the library itself. The tests do not compile. The Legacy modules are hidden until a way of dealing with them has been decided.

Mercerenies · 2020-03-26T01:24:58Z

Hello,

I just wanted to share that I'm in the process of submitting a Google Summer of Code 2020 proposal to provide a library with similar functionality, as it seems to be something many Haskell packages could benefit from, not least of all pandoc. The exact API is not finalized, but the proposal is in rough draft form at the moment. I do hope this is something that can benefit this project and many others.

jgm · 2020-03-26T03:54:45Z

@Mercerenies - the proposal looks quite interesting. How were you thinking it intersects with pandoc? Do have any suggestions about to the proposal above, or does it seem reasonable to you?

Mercerenies · 2020-03-26T13:09:59Z

The timing ended up being quite inconvenient, as I reached out to @tarleb about the proposal a few days before this issue was opened. That being said, I do still feel like a dedicated library for this kind of thing would be very nice to have, for several reasons, even if pandoc has its own type as well.

In terms of the above proposal, I share the concern about type vs newtype but understand why making that change would be an inconsistency with the rest of the library. Aside from that, what you said above seems pretty reasonable. I'd personally go for the version that doesn't involve a 7-arg constructor.

jgm · 2020-03-26T16:09:22Z

I completely agree that a dedicated library could be useful even if pandoc has its own type -- and there could be glue code converting between pandoc tables and this library's type.

This implements issue jgm#65 for the library itself. The tests do not compile. The Legacy modules are hidden until a way of dealing with them has been decided.

despresc mentioned this issue Mar 26, 2020

Better tables #66

Merged

jgm closed this as completed in f76c1b7 Apr 17, 2020

ickc mentioned this issue Jul 2, 2020

Supporting pandoc 2.11 sergiocorreia/panflute#142

Closed

jgm mentioned this issue Sep 18, 2020

Add colspan/rowspan support to Table #29

Closed

ickc mentioned this issue Nov 10, 2020

Supporting pandoc 2.11 ickc/pantable#51

Closed

ickc mentioned this issue Nov 28, 2020

Supporting pandoc-types 1.22 elliottslaughter/rust-pandoc-types#2

Closed

istathar mentioned this issue May 7, 2021

Change formatter to use pipe tables aesiniath/publish#57

Merged

jgm mentioned this issue Feb 19, 2025

docx table with w:firstColumn unmerges cells jgm/pandoc#10627

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving tables #65

Improving tables #65

despresc commented Mar 21, 2020

jgm commented Mar 22, 2020

jgm commented Mar 22, 2020

despresc commented Mar 22, 2020

jgm commented Mar 22, 2020

jgm commented Mar 22, 2020

despresc commented Mar 23, 2020

jgm commented Mar 23, 2020

despresc commented Mar 23, 2020

jgm commented Mar 23, 2020 •

edited

Loading

tarleb commented Mar 24, 2020

jgm commented Mar 24, 2020 •

edited

Loading

despresc commented Mar 24, 2020

jgm commented Mar 25, 2020

jgm commented Mar 25, 2020

despresc commented Mar 25, 2020

jgm commented Mar 25, 2020

Mercerenies commented Mar 26, 2020

jgm commented Mar 26, 2020

Mercerenies commented Mar 26, 2020

jgm commented Mar 26, 2020

Improving tables #65

Improving tables #65

Comments

despresc commented Mar 21, 2020

jgm commented Mar 22, 2020

jgm commented Mar 22, 2020

despresc commented Mar 22, 2020

jgm commented Mar 22, 2020

jgm commented Mar 22, 2020

despresc commented Mar 23, 2020

jgm commented Mar 23, 2020

despresc commented Mar 23, 2020

jgm commented Mar 23, 2020 • edited Loading

tarleb commented Mar 24, 2020

jgm commented Mar 24, 2020 • edited Loading

despresc commented Mar 24, 2020

jgm commented Mar 25, 2020

jgm commented Mar 25, 2020

despresc commented Mar 25, 2020

jgm commented Mar 25, 2020

Mercerenies commented Mar 26, 2020

jgm commented Mar 26, 2020

Mercerenies commented Mar 26, 2020

jgm commented Mar 26, 2020

jgm commented Mar 23, 2020 •

edited

Loading

jgm commented Mar 24, 2020 •

edited

Loading