Scraping public amazon s3 buckets with python

Security credentials, personal data, user data, and business information is often publicly exposed in leaky buckets.

Not all objects in the buckets are necessarily public, you can have a mix of access control rules for each one. With hundreds/thousands being private. Checking each of them would be extremely tedious / impossible for a single person. That is why I created this tool to do it for you.(well me, but I am sharing it here)

Github repo

The great thing about this is you do not require any credentials/signups to use it. It can be used anonymously if you wish.

Just enter the name of the Amazon S3 bucket(no google cloud platform support at this time), and it will spit out a .csv file containing all the key names, whether the object is public, last modified date, and size in bytes.

Example of script running with verbose option

Usage

It was designed to be used from the command line, here are the options:

python2 s3getkeys.py -t <bucket> [--key=<key>] [-r] [-v] [--acl] [-o=<file>]
python2 s3getkeys.py -t <bucket> [--key] [--estimate]
python2 s3getkeys.py -t <bucket> [-h|--help]

Options:
-t, --bucket <bucket> bucket to fetch keys from  
--key <key>           key to start from  
-r                    recursivly fetch all keys  
-v                    verbose, print keys 
--acl                 check if each key is public can take long time in large buckets  
-o, --output <file>   name of output file, do not include .csv[default:bucket]  
--estimate            estimate how long to run [-r][--acl]  
-h,--help             show this help info  

Caveats

For very large buckets eg: 500,000+, the script in it’s current form can take a very long time to run with the –acl flag. This is because we are not using multithreading or async in this preliminary version.

I will be updating soon to add prefixer/delimiter options and some other filtering, as well as the option to discard private keys being written to the .csv file(I would recommend not always doing this, as the key names in themselves can reveal a lot of data aswell. Such as; customer names, business proposals/plans, non-public project names etc).

I will also be looking for ways to include GCP buckets, as well as implementing speed increases.