Neil thank you for your reply.
The API is limited to I think 25k requests per day and 5 concurrent requests.
The initial response includes information about the number of pages - along with the data of page zero. Following requests will contain the same query with a specified page number.
I assume a call to a nonexistent page will return an error - I don’t see a scenario of requesting a non-existent page.
The data set will be fairly large and stored in a database immediately.
I will be importing 100 - 20,000 pages containing 100 records each - whenever a new report is needed. Likely on a daily basis. It is possible that the reporting frequency may exceed the API limitations. Response usually takes about a second and with rate limit of 5 concurrent requests that’s a maximum of 18k responses per 24h at best.
The importer will need to resume failed pages.
Yes, I will likely be using sidekiq.
I have worked with this same API some time ago and it was a PIA - because of the slowness of the API and their dropping the connections on a regular basis. In the previous app, there was no background job processing.
My plan so far, is:
Use a sidekiq worker for each of the pages. Have each worker update the database with the results, and update the query log with completion.
Thus, keep track of the query, successfully delivered pages, remaining pages, and indicate a query as completed once the last page has been delivered.
Somehow limit sidekiq job execution in this particular queue to 4 or 5 concurrent workers at a time. Open to suggestions on how to do this.
And, will need to figure out how to queue the workers in order to remain within the daily limit of the API. Should I look into suspending the queue if the limit is reached? Then un-suspend the next day? Seems too complicated.
A simpler system could just queue the jobs and rely on sidekiq to repeat the failed ones - failed including for the reason of exceeding daily limit.
Thoughts?