Split a listings JSON file#
Utility to split a JSON file into shards such that no shard is too big or has too many keys. Here, “too big” are “too many keys” are parameterized, see command-line flags below.
Example Usage:
# Install dependencies
pip install fire
pip install json
pip install ijson
# Specify paths of input and output files
INPUT_PATH="/tmp/input.json"
OUTPUT_PATH_PREFIX=/tmp/output_"
MAX_BYTES_PER_SHARD=100000
MAX_KEYS_PER_SHARD=100
# Example Usage:
python split_listing_batch.py \
--input_path $INPUT_PATH \
--output_path_prefix $OUTPUT_PATH_PREFIX \
--max_bytes_per_shard $MAX_BYTES_PER_SHARD \
--max_keys_per_shard $MAX_KEYS_PER_SHARD
# Expect: files with names starting with '/tmp/output_' such that
# no shard has more than 100000 bytes or 100 listings.
# See `split_json` function definition below for detailed documentation.
- tonita.split_listing_batch.split_json(input_path=None, output_path_prefix=None, max_keys_per_shard=10, max_bytes_per_shard=100000)#
Truncate a batch of listings to a fixed maximum size.
- Parameters:
input_path (str) – The path to the input batch JSON file.
output_path_prefix (str) – The prefix for output filenames. If the output files exist, they will be overwritten.
max_keys_per_shard (int) – The maximum number of listings per output file.
max_bytes_per_shard (int) – The maximum number of bytes per output file. Note that this is not guaranteed if there are key-value pairs whose size exceeds this argument.
- tonita.split_listing_batch.write_file(path_prefix, file_index, kv_pairs)#
Writes key-value pairs into an output file in JSON format.
- Parameters:
path_prefix (str) – The prefix for output filenames.
file_index (int) – An integer that is appended to
path_prefix
to construct the final name of the output file.kv_pairs (List[str, Any]) – The key-value pairs that will be written into the output file.