Split a listings JSON file#
Utility to split a JSON file into shards such that no shard is too big or has too many keys. Here, “too big” are “too many keys” are parameterized; see command-line flags below.
Example Usage:
# Specify input and output paths, as well as byte and key limits.
INPUT_PATH="/tmp/input.json"
OUTPUT_PATH_PREFIX=/tmp/output"
MAX_BYTES_PER_SHARD=100000
MAX_KEYS_PER_SHARD=100
# Example Usage:
python split_listing_batch.py \
--input_path $INPUT_PATH \
--output_path_prefix $OUTPUT_PATH_PREFIX \
--max_keys_per_shard $MAX_KEYS_PER_SHARD \
--max_bytes_per_shard $MAX_BYTES_PER_SHARD
The output of this command will be a set of files in the ‘/tmp/’ directory where each filename is prefixed by “output_” followed by its index (e.g., “/tmp/output_0.json”).
The JSON in each file will contain no more than the specified number of listings (the value passed to –max_keys_per_shard) and be no larger than the specified number of bytes (the value passed to –max_bytes_per_shard).
- tonita.split_listing_batch.split_json(input_path=None, output_path_prefix=None, max_keys_per_shard=10, max_bytes_per_shard=100000)#
Truncate a batch of listings to a fixed maximum size.
- Parameters:
input_path (str) – The path to the input batch JSON file.
output_path_prefix (str) – The prefix for output filenames. If the output files exist, they will be overwritten.
max_keys_per_shard (int) – The maximum number of listings per output file.
max_bytes_per_shard (int) – The maximum number of bytes per output file. Note that this is not guaranteed if there are key-value pairs whose size exceeds this argument.