Managing corpora#

This section walks through how to work with corpora.

Adding a corpus#

When we add listings, we add them to a particular corpus. When we perform search, we perform it for the listings in a particular corpus. Therefore, the very first step to getting started with Tonita is creating a corpus.

Doing so is straightforward: we simply call tonita.corpora.add(), providing the ID of the corpus we want to create. If the creation was successful, you will receive a AddCorpusResponse.

tonita.corpora.add(corpus_id="my_first_corpus")

# Example return value:
# AddCorpusResponse(corpus_id="my_first_corpus")

Attention

The corpus ID can contain only alphanumeric characters and underscores.

This corpus is now ready to use!

Note

Corpus IDs must be unique. For the same API key, there cannot be two corpora that exist simultaneously with the same name. If a corpus is added with the same name as an existing one, an error will be raised.

The state of a corpus#

Every corpus has a state, which can either be active or inactive:

  1. If a corpus is active, that means that it can be used: listings can be added to it and search can be performed over its listings. When a corpus is first created, its state is automatically set to active.

  2. Deleting a corpus sets its state to inactive. You can not add listings to an inactive corpus, nor can search be performed over the listings of an inactive corpus. The only operation you may perform on an inactive corpus is setting its state back to active by recovering it. Note that after seven (7) days of continuous inactivity, inactive corpora “expire”, at which point they become unrecoverable.

Listing corpora#

To see all corpora associated with a given API key, call tonita.corpora.list():

tonita.corpora.list()

This will list all corpora, along with their respective states:

# Example return value:
# ListCorporaResponse(
#     results: {
#         "my_first_corpus":<State.ACTIVE: 'ACTIVE'>,
#         "my_second_corpus": <State.INACTIVE: 'INACTIVE'>
#     }
# )

Getting information about a corpus#

Calling tonita.corpora.get() will return more detailed information about a given corpus:

tonita.corpora.get(corpus_id="my_corpus_id")

# Example return value:
# CorpusGetResponse(
#     corpus_id="my_corpus_id",
#     exists=True,
#     state=<State.INACTIVE: 'INACTIVE'>,
#     seconds_to_expiration=1.72
# )

Its return value, CorpusGetResponse, contains the following fields: corpus_id: The ID of the corpus. exists: A boolean indicating whether the corpus exists. state: The state of the corpus. seconds_to_expiration: If the corpus is inactive, the time (in seconds) before it becomes unrecoverable.

In the example above, the corpus exists but is inactive, and therefore has a value for seconds_to_expiration. An active corpus will have seconds_to_expiration=None. A corpus that does not exist will not only have exists=False, but both state and seconds_to_expiration will be None.

Deleting and recovering corpora#

Deleting a corpus is straightforward:

tonita.corpora.delete(corpus_id="my_corpus_id")

If the deletion was successful, you will receive a DeleteCorpusResponse:

# Example return value:
# DeleteCorpusResponse(corpus_id="my_corpus_id")

Strictly speaking, however, calling tonita.corpora.delete() only schedules the corpus for deletion by making the corpus inactive. Inactive corpora are recoverable for seven (7) days after they first become inactive.

To recover a corpus (i.e., make an inactive corpus active), call tonita.corpora.recover():

tonita.corpora.recover(corpus_id="my_corpus_id")

If the recovery was successful, you will receive a RecoverCorpusResponse:

# Example return value:
# RecoverCorpusResponse(corpus_id="my_corpus_id")

If a corpus is continuously inactive for seven (7) days, however, it cannot be recovered.

For example, suppose I have the following corpora:

tonita.corpora.list()

# Example return value:
# ListCorporaResponse(
#     results: {
#         "my_first_corpus":<State.ACTIVE: 'ACTIVE'>,
#         "my_second_corpus": <State.ACTIVE: 'ACTIVE'>
#     }
# )

We plan to delete the corpus named “my_first_corpus”. However, before doing so, let’s take a look at its information:

tonita.corpora.get("my_first_corpus")

# Example return value:
# CorpusGetResponse(
#     corpus_id="my_corpus_id",
#     exists=True,
#     state=<State.ACTIVE: 'ACTIVE'>,
#     seconds_to_expiration=None
# )

Now, let’s delete it, and look at its information again:

tonita.corpora.delete(corpus_id="my_first_corpus")

# Example return value:
# DeleteCorpusResponse(corpus_id="my_first_corpus")

tonita.corpora.get("my_first_corpus")

# Example return value:
# CorpusGetResponse(
#     corpus_id="my_corpus_id",
#     exists=True,
#     state=<State.INACTIVE: 'INACTIVE'>,
#     seconds_to_expiration=604783.23
# )

Note that it is now inactive. Further, it now has a value for the seconds_to_expiration field, indicating that it will expire and become unrecoverable in about seven (7) days.

Before this happens, however, we can recover it:

tonita.corpora.recover("my_first_corpus")

# Example return value:
# RecoverCorporaResponse(corpus_id="my_first_corpus")

tonita.corpora.get("my_first_corpus")

# Example return value:
# CorpusGetResponse(
#     corpus_id="my_corpus_id",
#     exists=True,
#     state=<State.ACTIVE: 'ACTIVE'>,
#     seconds_to_expiration=None
# )

Now suppose the corpus has expired already:

tonita.corpora.get("my_first_corpus")

# Example return value:
# CorpusGetResponse(
#     corpus_id="my_corpus_id",
#     exists=False,
#     state=None,
#     seconds_to_expiration=None
# )

Trying to recover this corpus now will raise a TonitaBadRequestError since the corpus no longer exists.

Attention

It might take some time for an expired corpus and its data to be removed completely from our databases. Therefore, the ID of a recently expired corpus may not be immediately available to re-use.