Jump to content

Better Amazon S3 Storage Method


KT Walrus

Recommended Posts

This week, I wrote a plugin for the FileSystem Storage Method that uses an Amazon S3 bucket as a "backing store" and treats the local filesystem storage as a file cache. That is, the FileSystem Storage Method saves new files to both the local filesystem and to S3. When the contents are needed, the plugin tries to retrieve the local filesystem copy. Failing that, it tries to retrieve from S3 and saves the content to the local filesystem (for the next time the file is needed). I intend to write a system task to limit the size of the local filesystem and purge the oldest files when the limit is reached.

I used the Amazon PHP SDK (https://github.com/aws/aws-sdk-php) and the tremendous Stream Wrappers feature (http://docs.aws.amazon.com/aws-sdk-php/v2/guide/feature-s3-stream-wrapper.html). 

Quote

The Amazon S3 stream wrapper allows you to store and retrieve data from Amazon S3 using built-in PHP functions like file_get_contents, fopen, copy, rename, unlink, mkdir, rmdir, etc.

You need to register the Amazon S3 stream wrapper in order to use it:


// Register the stream wrapper from an S3Client object
$client->registerStreamWrapper();

This allows you to access buckets and objects stored in Amazon S3 using the s3:// protocol. The "s3" stream wrapper accepts strings that contain a bucket name followed by a forward slash and an optional object key or prefix: s3://<bucket>[/<key-or-prefix>].

 
So, implementing this S3 backing store was really quite trivial to add to the existing FileSystem Storage Method.

I believe this is a much better Storage Method than either the current FileSystem or Amazon Storage Methods since:

  1. All files are "backed up" to reliable S3 bucket storage.
  2. The local filesystem has a quota on it so that if you run out of local storage on your server, the oldest files on the local filesystem can be purged and then S3 can be used when the file is needed to restore the file back to the local filesystem.
  3. Amazon doesn't charge for upload bandwidth, but has pretty high charges for download bandwidth. Because this plugin uses the local filesystem as a cache, your Amazon bandwidth bill should be greatly reduced.
  4. The Amazon PHP SDK allows for making requests asynchronously so multiple uploads (or upload chunks) or downloads can be done simultaneously without eating up too much PHP memory. So, the interactions with S3 can be performed much more quickly. I didn't incorporate this yet in my plugin, but IPS4's implementation could take advantage of async calls to S3, if maximum performance is desired.
  5. The uploading of a new file to the site (from the user's perspective) could be made to be much quicker (especially for large files). I intend to modify my plugin to simply store the file locally (which is very fast) and store a record in the local database that the file needs to be uploaded to S3 from the local file on disk as soon as possible. A system task would then check for any files that need to be uploaded to S3 and perform that task (in the cron job). 
  6. From the user's point of view, all files are uploaded/downloaded from the site's domain (giving the possibility of minimizing the number of HTTP requests made to load a page with lots of attachments/profile photos by using HTTP/2). This could make our sites even faster to load.

Anyway, I'm sure IPS4 Devs could implement an even better Storage Method using the Amazon SDK and the existing FileSystem Storage Method. I hope you would consider this for 4.2.0 (or 4.2.1) so I don't have the burden of maintaining my plugin just for my site.

BTW, I plan on adding OpenStack as an option to my backing store plugin. I'm using the PHP OpenStack SDK (https://github.com/php-opencloud/openstack) to access OVH Cloud Object Storage. OVH uses OpenStack with Identity v2 authentication (which took me a while to figure out), but it is now working for me. OVH has much lower prices than Amazon for Object Storage so I can lower my storage costs even further. See OVH Object Storage pricing on this page: https://www.ovh.com/us/public-cloud/storage/object-storage/, if you are interested in adding support to IPS4 for this option.

Link to comment
Share on other sites

One more thing... Since it took me a while to figure out how to authenticate with OVH's OpenStack service, I went ahead and requested that the PHP OpenStack SDK should document how to do this. OVH runs the deprecated Identity v2 while Rackspace runs the current Identity v3 (Rackspace is the creator of the PHP OpenStack SDK, as far as I know).

If you do decide to add support for OVH Object Storage, you should consult my issue here:

https://github.com/php-opencloud/openstack/issues/127

If you decide to also support Rackspace Object Storage, you can have a configuration option to specify whether OpenStack is running with Identity v2 or Identity v3. This way, the IPS4 Storage Method could support any OpenStack implementation. As far as I know, there is only one version of the OpenStack Object Store and it is v1, so you would have universal support for OpenStack (either Public Cloud or on-premises Private Cloud).

Link to comment
Share on other sites

I just found an even better SDK to base the Ultimate IPS4 Storage Method:

http://flysystem.thephpleague.com

This allows you to "mount" multiple filesystems (like Local, Redis, Memcache, AWS S3, Rackspace, SFTP, Dropbox, GridFS, etc.) and use a single API to manage the files in the mounted filesystems.

This should allow a very flexible IPS4 Storage Method where the admin can choose which filesystems (local, network, or cloud) to save files to and whether the chosen filesystem is to be used as a cache or a persistent datastore. In my case, Amazon S3 will be the persistent datastore, but the local filesystem (which actually is a shared filesystem mounted to all my PHP/Nginx servers) will be a cache (with initial files stored in the cache and uploaded to Amazon S3 by a system task). I think I will implement this Storage Method with a database table to track which filesystems contain the file. This way, the file will be marked initially as being in the local filesystem (using the "local:/container/filename" path). Later, the table row will be updated to reflect that the file also exists in S3 (using the "s3:/bucket/container/filename").

With this API, my Storage Method will be able to store the file in multiple caches and in multiple backing stores as selected by the admin. The backing stores will be updated by system task and the local caches updated when files are initially saved or when they are fetched from a backing store.

I will probably have to generate an md5_hash and store it in the database table so the system task that uploads/moves files into the persistent datastore can check the integrity of newly uploaded files (by downloading the file on the next run of the system task checking that it has the same md5_hash). The backing stores for most cloud providers claim 100% durability so once the integrity of the file has been confirmed, it should remain that way. When the Storage Method downloads a file from a persistent datastore, it can confirm the md5_hash again (just in case of a networking error).

I also will probably encrypt the file on the local filesystem before copying it to the persistent store. I may even keep the files encrypted in the local cache. That way, I am sure that even if my cloud provider has a security breech, my users data will be protected.

Link to comment
Share on other sites

On ‎01‎.‎07‎.‎2017 at 7:46 PM, KT Walrus said:

This week, I wrote a plugin for the FileSystem Storage Method that uses an Amazon S3 bucket as a "backing store" and treats the local filesystem storage as a file cache.

Is it going to be available in the marketplace?

I think the idea is interesting, will save a lot money due to the reasons above, and adds extra security levels.

Link to comment
Share on other sites

On 7/9/2017 at 4:44 PM, Cyboman said:

Is it going to be available in the marketplace?

 

Sorry. I don't have the skills to make this generally available. I was hoping that IPS4.2 (or 4.2.1) would implement this for all.

I did end up implementing the following:

  • Each PHP server has a cache directory for caching files on the local hard disk (cache is cleared nightly so it never grows too large)
  • Saving file first stores the file in the local cache directory and computes an md5 and filesize.
  • A row is created in the database to track the file in permanent storage (saving the md5 and filesize) so it may be later uploaded to the Cloud.
  • The file in the cache is streamed to GridFS (MongoDB permanent storage). This is very quick. File is replicated to 2 MongoDB servers locally.
  • A system task streams the new GridFS files to an OVH Openstack Container and to Amazon S3 Bucket (immediately saving the file in Glacier for offsite backup).  MD5 is verified for all uploads to Amazon and OVH (still looking into how to do this for OVH).
  • When the contents() of a file are needed, the local cache is checked. On cache miss, the file is restored to the cache from GridFS or OVH (preferring GridFS). The file restored is checked for proper md5 before saving to the cache.

The monthly cost of this Storage Method is:

  • Next to nothing for PHP server cache
  • Less than $0.02/GB for GridFS storage (next to nothing until my servers existing HDDs are filled and I have to purchase file servers for GridFS storage)
  • $0.0112/GB for OVH Object Storage
  • $0.004/GB for Amazon Glacier Storage
  • $0.011/GB for retrieval from OVH Object Storage (only done if both instances of MongoDB go offline when needing to refresh the local cache).

So, under 4 cents per GB per month (and much cheaper to start) with hardly any bandwidth charges except for restoring files from Cloud storage in the unlikely case that GridFS loses both copies of the data or goes down temporarily.

Cost to restore from OVH Object Storage is $0.011/GB. Cost to restore from Amazon S3 Bucket around $0.10/GB (since I host at OVH and bandwidth from S3 to OVH is expensive). I'm only storing in Glacier just in case OVH unexpected shuts down or loses all my files.

Link to comment
Share on other sites

I've enhanced my Storage Method so that it uses the local filesystem for a cache, and saves all files in an S3-compatible object storage server called Minio. Minio is easy to deploy and uses local HDDs for very cheap object storage (that is highly-available and durable, just like AWS S3). The great thing about Minio is that you use AWS PHP SDK to manage the objects in Minio. This means that I can use the S3 Stream Wrappers to adapt the FileSystem.php storage method to stream files to/from Minio.

In addition to storing all files in Minio and using Minio to refresh the local filesystem on cache-miss, my Storage Method inserts a row in a MySQL table for each file saved in Minio with the added time() and a null archived time. A simple system task runs once a day to archive all newly stored files in an Amazon Glacier bucket (via S3 bucket). This integrates doing backups for my new Storage Method.

This is a much simpler Storage Method and runs very fast and uses HDDs attached to my servers. AWS Glacier storage is very cheap ($0.004/GB) and is only going to be used for disaster recovery. Minio has a nice command line client called "mc" and has a client that mounts the Minio bucket as a local filesystem so I can easily administer the Objects stored in Minio. 

I'm using my new Storage Method to store all user uploaded files (mainly profile photos and attachments, for my site). I'm also thinking of mounting a Minio bucket to the local filesystem for storage of server generated files (like JS, CSS, and HTML templates) and even the IPS4 source files so the same set of files are available to all NGINX/PHP servers while being highly-available and durable (because they are stored in a distributed Minio bucket). 

This is turning out to be a better storage method:

 

 

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Upcoming Events

    No upcoming events found
×
×
  • Create New...