Skip to content

Harvest O&M and Debugging

James Brown edited this page Sep 15, 2025 · 4 revisions

Harvest O&M

In order to properly manage data.gov's harvest system, we have the following procedures, tools, guidelines, and cadences to make sure things are working properly. This is meant to be a "living" document, with edits as new discoveries and/or updates are made.

Error Handling

Most reported errors (validation, transformation, source/job errors, etc) are "expected" errors. If there are recurring issues getting data from a source, or most of a source being invalid, this should be raised to the data provider via team lead.

However, there are jobs that fail due to long-running infrastructure reasons. These are highlighted on the metrics page (prod link to be created) at the bottom, see here. image

These jobs should be examined more carefully. It is possible (actually fairly likely) that the process was killed while waiting for a response from CKAN. We don't know if that load was successful or not. In order to verify:

  • Go to the cloud.gov log dashboard like this. Zoom into the right day, and make sure to update the query for the job ID. This will give you a time frame of when the job was running. You can try to utilize the start and end times reported on the job, but the end time might be later (in some cases much later) than the task was actually doing work.
  • Use this CKAN API call, updating the harvest_source_title and the metadata created and modified dates for the job you are looking for. Unfortunately we don't keep track of the job ID, so we have to match on harvest source and timing. If there are discrepancies in the job counts (either more or less on CKAN than in the harvester), the sync job should be run immediately (details TBD). Likely there are things that need to be updated.

After any sync up process is complete, an evaluation should be done to decide if there is a need to kick off a re-harvest (is the next scheduled harvest far out/nonexistent? How many changes are there normally?).

Harvest API

The Harvest API gives us a lot of power to be able to find, classify, and debug various issues. The routes are created here, but can be a bit confusing to parse. Each of our 6 object types (organizations, harvest_sources, harvest_jobs, harvest_records, harvest_job_errors, and harvest_record_errors) is queryable. The following simple options are available:

  • page: the page number you want to extract
  • per_page: the number of records per page you want to pull
  • paginate: if you want to utilize pagination or not, defaults to true
  • count: if you want to get just the count of results, and not the results themselves

The next are more complicated ways to filter or order the data:

  • order_by: the field(s) you want to be ordered by
  • facets: a custom "where" type filtering that is safe from SQL injection but allows various comparison.

The facets filter can include multiple evaluations, such as finding error records that were created between a start and end time: https://harvest.data.gov/harvest_record_errors/?facets=date_created%20gt%202025-08-21T22:00:00,%20date_created%20lt%202025-08-21T23:59:59

You can also use ilike_op to search for certain strings: https://harvest.data.gov/harvest_record_errors/?facets=message%20ilike_op%20failed%25

You shouldn't need to quote anything in the facets field. The full list of operators can be seen here.

Please add more useful examples of queries of the H2.0 system here as they are discovered and/or used!

Clone this wiki locally