How do you go about consuming paged APIs from your Rails app?

So I am writing an app helping with reporting based on data from a third party service.

My plan is to first, ingest the data from the API, and then to run reports on that data.

The API is paged, and sluggish, with the occasional timeout. How do I ingest large amounts of data from such a beast?

Ideas? Experience? Advice?

A lot depends on the API, of course, but there are other important questions as well:

  • does the API endpoint use throttling (limiting number of requests per timeslot)?
  • does the response include information about next page?
  • what it returns if you request a non-existing page?
  • how big is the resulting data set - is it ok to keep it in memory, or store it in, say, a DB?
  • how often do you need to import it? how many pages are there? will they be ingested in less than reimport_frequency? Depending on the answer, you might want to either process data as it goes, page-by-page, or split the import into two phases: fetching and storing the raw response to retrieve the full set as fast as possible, and then process it in batches, paralleling the ingestion as much as the hardware allows you to.
  • do you need a “resume” functionality for your importer?
  • do you have infrastructure for background processing, like Resque/Sidekiq, or you’ll be running it as a one-off script or a rake task?

All in all, you’ll probably benefit from using a gem “parallel” to retrieve the content in threads. If you know the number of pages in advance, you can just create an array of URLs and process that array in parallel.

If not, then you’ll probably want to use a Queue or a Thread pool (like gem “easy_threadpool”). For example, if you use a pool of 10 threads, you can start with enqueuing, say, 10 first pages. As soon as the page number X is fetched and processed, and if it has any content, you just add a page X+10 to a pool/queue processing. If the fetched page has no content, then you just don’t add anything.

For the occasional timeout, wrap the request in a rescue block and keep a counter of how many times you’ve rescued, wait a bit before retrying, then retry, but no more than, say, 10 times - maybe that page is just borked on their end. If it is - just move on to the next one.

You’ll probably end up with something like this: Fetching and processing undefined number of pages in a threaded manner · GitHub

@NeilDouglas’s questions are great. There’s a lot to consider for this kind of feature. I’ll try to give a general answer based on my experience.

I usually create a custom Enumerator for API results that automatically requests the next page when needed. If you’re ingesting data in the background and the API is slow or times out, it’s probably best to process as a job.

I like to use the JobIteration gem with the custom enumerator, but you could also make a job that reads and processes a single page, and then enqueues itself with the next page if there is a next page. Using jobs essentially gives you retries for free depending on your backend job processor. (I would use Sidekiq)

One thing to watch out for is offset-based pagination. If the pagination for this API is based on offsets, e.g. ?page=2, ?page=3, it’s possible that your job will end up skipping or duplicating items as other items are added or removed.

Ideally you want to use cursor-based pagination. Instead of saying “I’ve got the first 20 results, please give me the next 20”, it says something like “I’ve got up to ID 123, please give me the next 20”. With sequential IDs, ordered by ID, this would ensure you can’t miss or duplicate anything, even as items are added or removed.

1 Like

Neil thank you for your reply.

The API is limited to I think 25k requests per day and 5 concurrent requests.

The initial response includes information about the number of pages - along with the data of page zero. Following requests will contain the same query with a specified page number.

I assume a call to a nonexistent page will return an error - I don’t see a scenario of requesting a non-existent page.

The data set will be fairly large and stored in a database immediately.

I will be importing 100 - 20,000 pages containing 100 records each - whenever a new report is needed. Likely on a daily basis. It is possible that the reporting frequency may exceed the API limitations. Response usually takes about a second and with rate limit of 5 concurrent requests that’s a maximum of 18k responses per 24h at best.

The importer will need to resume failed pages.

Yes, I will likely be using sidekiq.

I have worked with this same API some time ago and it was a PIA - because of the slowness of the API and their dropping the connections on a regular basis. In the previous app, there was no background job processing.

My plan so far, is:

Use a sidekiq worker for each of the pages. Have each worker update the database with the results, and update the query log with completion.

Thus, keep track of the query, successfully delivered pages, remaining pages, and indicate a query as completed once the last page has been delivered.

Somehow limit sidekiq job execution in this particular queue to 4 or 5 concurrent workers at a time. Open to suggestions on how to do this.

And, will need to figure out how to queue the workers in order to remain within the daily limit of the API. Should I look into suspending the queue if the limit is reached? Then un-suspend the next day? Seems too complicated.

A simpler system could just queue the jobs and rely on sidekiq to repeat the failed ones - failed including for the reason of exceeding daily limit.

Thoughts?

Hi, thank you for this. I think in my case I’ll go with background jobs - hoping to get automated queue management. The API is slow but allows for up to 5 concurrent requests so I’ll use that.