.. currentmodule:: pyarrow.json

Reading JSON files

Arrow supports reading columnar data from line-delimited JSON files. In this context, a JSON file consists of multiple JSON objects, one per line, representing individual data rows. For example, this file represents two rows of data with four columns "a", "b", "c", "d":

{"a": 1, "b": 2.0, "c": "foo", "d": false}
{"a": 4, "b": -5.5, "c": null, "d": true}

The features currently offered are the following:

multi-threaded or single-threaded reading
automatic decompression of input files (based on the filename extension, such as my_data.json.gz)
sophisticated type inference (see below)

Note

Currently only the line-delimited JSON format is supported.

Usage

JSON reading functionality is available through the :mod:`pyarrow.json` module. In many cases, you will simply call the :func:`read_json` function with the file path you want to read from:

>>> from pyarrow import json
>>> fn = 'my_data.json'
>>> table = json.read_json(fn)
>>> table
pyarrow.Table
a: int64
b: double
c: string
d: bool
>>> table.to_pandas()
   a    b     c      d
0  1  2.0   foo  False
1  4 -5.5  None   True

Automatic Type Inference

Arrow :ref:`data types <data.types>` are inferred from the JSON types and values of each column:

JSON null values convert to the null type, but can fall back to any other type.
JSON booleans convert to bool_.
JSON numbers convert to int64, falling back to float64 if a non-integer is encountered.
JSON strings of the kind "YYYY-MM-DD" and "YYYY-MM-DD hh:mm:ss" convert to timestamp[s], falling back to utf8 if a conversion error occurs.
JSON arrays convert to a list type, and inference proceeds recursively on the JSON arrays' values.
Nested JSON objects convert to a struct type, and inference proceeds recursively on the JSON objects' values.

Thus, reading this JSON file:

{"a": [1, 2], "b": {"c": true, "d": "1991-02-03"}}
{"a": [3, 4, 5], "b": {"c": false, "d": "2019-04-01"}}

returns the following data:

>>> table = json.read_json("my_data.json")
>>> table
pyarrow.Table
a: list<item: int64>
  child 0, item: int64
b: struct<c: bool, d: timestamp[s]>
  child 0, c: bool
  child 1, d: timestamp[s]
>>> table.to_pandas()
           a                                       b
0     [1, 2]   {'c': True, 'd': 1991-02-03 00:00:00}
1  [3, 4, 5]  {'c': False, 'd': 2019-04-01 00:00:00}

Customized parsing

To alter the default parsing settings in case of reading JSON files with an unusual structure, you should create a :class:`ParseOptions` instance and pass it to :func:`read_json`. For example, you can pass an explicit :ref:`schema <data.schema>` in order to bypass automatic type inference.

Similarly, you can choose performance settings by passing a :class:`ReadOptions` instance to :func:`read_json`.

Incremental reading

For memory-constrained environments, it is also possible to read a JSON file one batch at a time, using :func:`open_json`.

In this case, type inference is done on the first block and types are frozen afterwards. To make sure the right data types are inferred, either set :attr:`ReadOptions.block_size` to a large enough value, or use :attr:`ParseOptions.explicit_schema` to set the desired data types explicitly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

json.rst

json.rst

Reading JSON files

Usage

Automatic Type Inference

Customized parsing

Incremental reading

Files

json.rst

Latest commit

History

json.rst

File metadata and controls

Reading JSON files

Usage

Automatic Type Inference

Customized parsing

Incremental reading