Last week AWS announced a built-in RedShift Data API.
The Data API supports web service-based applications and uses IAM or database credentials stored in AWS secret manager. Credentials or passwords are not passed in the API calls. Additionally, a persistent connection is not needed to the RedShift cluster, instead relying on secure HTTP endpoints with asynchronous calls.
What it enables is a simplified secure way to access RedShift data using services such as Lambda and SageMaker. This adds versatility to RedShift and provides a new option when designing applications that leverage insights from your data warehouse.
A few things that stand out for me is the data retention and logging. The Data API will retain a query result for a maximum 24 hours, increasing efficiency for subsequent requests. Details on the configuration for the retention are not described. I do hope it becomes an option when making calls to allow applications to optimize it.
That being said, if you are concerned about query performance, you can use the Data API to query the metadata of your calls. When you execute a query the Data API will return an identifier for the result set.
You can use the identifier to fetch the results or run a describe statement to obtain useful information such as:
- result size
- the RedShift process ID
- the query ID.
In addition, the Data API is integrated with CloudTrail, which will capture all API calls. This includes the originating IP, and additional details needed to support it in a production environment.
Before you jump into using the Data API, make sure to look at its limitations. It does have a query limit of 100 MB and a row limit of 64 KB. If the result set is greater than either limit the call is ended. You can get around the 100 MB constraint by adding limits to your queries and using subsequent calls. That might add complexity in the calling of the APIs, and you need to evaluate your use case to not force a solution.