Skip to content

Latest commit

 

History

History
129 lines (98 loc) · 4.23 KB

json.rst

File metadata and controls

129 lines (98 loc) · 4.23 KB
.. currentmodule:: pyarrow.json

Reading JSON files

Arrow supports reading columnar data from line-delimited JSON files. In this context, a JSON file consists of multiple JSON objects, one per line, representing individual data rows. For example, this file represents two rows of data with four columns "a", "b", "c", "d":

{"a": 1, "b": 2.0, "c": "foo", "d": false}
{"a": 4, "b": -5.5, "c": null, "d": true}

The features currently offered are the following:

  • multi-threaded or single-threaded reading
  • automatic decompression of input files (based on the filename extension, such as my_data.json.gz)
  • sophisticated type inference (see below)

Note

Currently only the line-delimited JSON format is supported.

Usage

JSON reading functionality is available through the :mod:`pyarrow.json` module. In many cases, you will simply call the :func:`read_json` function with the file path you want to read from:

>>> from pyarrow import json
>>> fn = 'my_data.json'
>>> table = json.read_json(fn)
>>> table
pyarrow.Table
a: int64
b: double
c: string
d: bool
>>> table.to_pandas()
   a    b     c      d
0  1  2.0   foo  False
1  4 -5.5  None   True

Automatic Type Inference

Arrow :ref:`data types <data.types>` are inferred from the JSON types and values of each column:

  • JSON null values convert to the null type, but can fall back to any other type.
  • JSON booleans convert to bool_.
  • JSON numbers convert to int64, falling back to float64 if a non-integer is encountered.
  • JSON strings of the kind "YYYY-MM-DD" and "YYYY-MM-DD hh:mm:ss" convert to timestamp[s], falling back to utf8 if a conversion error occurs.
  • JSON arrays convert to a list type, and inference proceeds recursively on the JSON arrays' values.
  • Nested JSON objects convert to a struct type, and inference proceeds recursively on the JSON objects' values.

Thus, reading this JSON file:

{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}}
{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}

returns the following data:

>>> table = json.read_json("my_data.json")
>>> table
pyarrow.Table
a: list<item: int64>
  child 0, item: int64
b: struct<c: bool, d: timestamp[s]>
  child 0, c: bool
  child 1, d: timestamp[s]
>>> table.to_pandas()
           a                                       b
0     [1, 2]   {'c': True, 'd': 1991-02-03 00:00:00}
1  [3, 4, 5]  {'c': False, 'd': 2019-04-01 00:00:00}

Customized parsing

To alter the default parsing settings in case of reading JSON files with an unusual structure, you should create a :class:`ParseOptions` instance and pass it to :func:`read_json`. For example, you can pass an explicit :ref:`schema <data.schema>` in order to bypass automatic type inference.

Similarly, you can choose performance settings by passing a :class:`ReadOptions` instance to :func:`read_json`.

Incremental reading

For memory-constrained environments, it is also possible to read a JSON file one batch at a time, using :func:`open_json`.

In this case, type inference is done on the first block and types are frozen afterwards. To make sure the right data types are inferred, either set :attr:`ReadOptions.block_size` to a large enough value, or use :attr:`ParseOptions.explicit_schema` to set the desired data types explicitly.