Bug that cost couple of thousands USD, AWS Rusoto AssumeRole throttling bug

I found a bug in Rusoto, it’s the best & maybe the only AWS SDK for Rust programming language. It was painful, and irritating but the feeling I had when the bug was approved that it’s actually a bug was amazing!


What happened?

I was developing a Kinesis Consumer Client Library (KCL) in Rust, and suddenly after a month the AWS cloud trail bill increased by couple of thousand USD.  What the fuck happened, it’s AssumeRole for my KCL!!

The bug in a nutshell, AssumeRole API was getting called 1 Million times in 1 hour instead of only one time (as it should be).


Bug Effect

  1. It was a multi-account AWS setup.
  2. The KCL was in account A & the Kinesis stream it self was in account B.
  3. AssumeRole is used for a cross-account authentication, so to use a service in another account. You have to do AssumeRole first.
  4. AssumeRole session lives for 1 hour & could be extended to 12 hours with AWS support help.
  5. You have to use this session while calling any api for the other account & when it’s expired re-call AssumeRole API ——-> Here was the bug in Rusoto.
  6. There was an AssumeRole request happening with each other request (~17000 per minute).
  7. So it was ~1M request per hour & it was supposed to be only one request.
  8. That caused throttling for the API because of AWS rate limits on AssumeRole API.
  9. Throttling the API resulted in way more logs on CloudTrail to notify that someone is abusing the AssumeRole API.
  10. More logs on CloudTrail caused the increase in the bill.

Bug Details

  1. The session_duration parameter is not used for caching.
  2. It causes:
    1. Huge performance issue, because it’s 2 requests instead of 1 (your API request + AssumeRole request).
    2. Also it causes throttling the Assume role API if you have a high load, which leads more money if CloudTrail is enabled.
  3. The session is valid for one hour, so it should be used till it’s expired.

Example: Kinesis stream get records API is calling Assume Role with each request instead of using the cached value.

The Solution is really simple & it’s a one line of code, use rusoto_credential::AutoRefreshingProvider to wrap the StsAssumeRoleSessionCredentialsProvider.


Post Mortem

I struggled a-lot in convincing my self that the bug is in Rusoto, it’s the best sdk out there, they can’t have such a bug & if they do I won’t be the first to find it. No Way!

Also because I’m newbie in Rust, it was very hard to debug the code as it’s really complicated.

This was a mistake, I would have saved a-lot of time and effort if this wasn’t my mind set.

Have faith in yourself!