Split a listings JSON file#

Utility to split a JSON file into shards such that no shard is too big or has too many keys. Here, “too big” are “too many keys” are parameterized; see command-line flags below.

Example Usage:

# Specify input and output paths, as well as byte and key limits.
INPUT_PATH="/tmp/input.json"
OUTPUT_PATH_PREFIX=/tmp/output"
MAX_BYTES_PER_SHARD=100000
MAX_KEYS_PER_SHARD=100

# Example Usage:
python split_listing_batch.py \
    --input_path $INPUT_PATH \
    --output_path_prefix $OUTPUT_PATH_PREFIX \
    --max_keys_per_shard $MAX_KEYS_PER_SHARD \
    --max_bytes_per_shard $MAX_BYTES_PER_SHARD

The output of this command will be a set of files in the ‘/tmp/’ directory where each filename is prefixed by “output_” followed by its index (e.g., “/tmp/output_0.json”).

The JSON in each file will contain no more than the specified number of listings (the value passed to –max_keys_per_shard) and be no larger than the specified number of bytes (the value passed to –max_bytes_per_shard).

tonita.split_listing_batch.split_json(input_path=None, output_path_prefix=None, max_keys_per_shard=10, max_bytes_per_shard=100000)#

Truncate a batch of listings to a fixed maximum size.

Parameters:
  • input_path (str) – The path to the input batch JSON file.

  • output_path_prefix (str) – The prefix for output filenames. If the output files exist, they will be overwritten.

  • max_keys_per_shard (int) – The maximum number of listings per output file.

  • max_bytes_per_shard (int) – The maximum number of bytes per output file. Note that this is not guaranteed if there are key-value pairs whose size exceeds this argument.