Split a listings JSON file#

Utility to split a JSON file into shards such that no shard is too big or has too many keys. Here, “too big” are “too many keys” are parameterized, see command-line flags below.

Example Usage:

# Install dependencies
pip install fire
pip install json
pip install ijson

# Specify paths of input and output files
INPUT_PATH="/tmp/input.json"
OUTPUT_PATH_PREFIX=/tmp/output_"
MAX_BYTES_PER_SHARD=100000
MAX_KEYS_PER_SHARD=100

# Example Usage:
python split_listing_batch.py \
    --input_path $INPUT_PATH \
    --output_path_prefix $OUTPUT_PATH_PREFIX \
    --max_bytes_per_shard $MAX_BYTES_PER_SHARD \
    --max_keys_per_shard $MAX_KEYS_PER_SHARD

# Expect: files with names starting with '/tmp/output_' such that
# no shard has more than 100000 bytes or 100 listings.

# See `split_json` function definition below for detailed documentation.
tonita.split_listing_batch.split_json(input_path=None, output_path_prefix=None, max_keys_per_shard=10, max_bytes_per_shard=100000)#

Truncate a batch of listings to a fixed maximum size.

Parameters:
  • input_path (str) – The path to the input batch JSON file.

  • output_path_prefix (str) – The prefix for output filenames. If the output files exist, they will be overwritten.

  • max_keys_per_shard (int) – The maximum number of listings per output file.

  • max_bytes_per_shard (int) – The maximum number of bytes per output file. Note that this is not guaranteed if there are key-value pairs whose size exceeds this argument.

tonita.split_listing_batch.write_file(path_prefix, file_index, kv_pairs)#

Writes key-value pairs into an output file in JSON format.

Parameters:
  • path_prefix (str) – The prefix for output filenames.

  • file_index (int) – An integer that is appended to path_prefix to construct the final name of the output file.

  • kv_pairs (List[str, Any]) – The key-value pairs that will be written into the output file.