Min requirements of AWS S3 One Zone IA (https://aws.amazon.com/s3/storage-classes/?nc=sn&loc=3)
How would you store >10PB if you'd be in my shoes? Thought experiment can be with and without data transfer cost our of current S3 buckets. Please mention also what your experience is based on. Ideally you store large amounts of data yourself and speak of first hand experience.
Thank you for your support!! I will post a thread once we got to a decision on what we ended up doing.
Update: Should have mentioned earlier, data needs to be accessible at all time. It’s user generated data that is downloaded in the background to a mobile phone, so super low latency is not important, but less than 1000ms required.
The data is all images and videos, and no queries need to be performed on the data.
HPE sells their Apollo 4000[^1] line, which takes 60x3.5" drives - with 16TB drives, that's 960TB each machine, one rack of 10 of these is 9PB+ therefore, which nearly covers your 10PB needs. (We have some racks like this). They are not cheap. (Note: Quanta makes servers that can take 108x3.5" drive, but they need special deep racks.)
The problem here would be the "filesystem" (read: the distributed service): I don't have much experience with Ceph, and ZFS across multiple machines is nasty as far as I'm aware, but I could be wrong. HDFS would work, but the latency can be completely random there.
[^1]: https://www.hpe.com/uk/en/storage/apollo-4000.html
So unless you are desperate to save money in the long run, stick to the cloud, and let someone else sweat about the filesystem level issues :)
EDIT: btw, we let the dead drives "rot": replacing them would cost more, and the failure rate is not that bad, so they stay in the machine, and we disable them in fstabs, configs, etc.
EDIT2: at 10PB HDFS would be happy; buy 3 racks of those apollos, and you're done. We started struggling at 1000+ nodes first; now, with 2400 nodes, nearly 250PB raw capacity, and literally a billion filesystem objects, we are slow as f*, so plan carefully.
10PB costs more than $210,000 per month at S3, or more than $12M after five years.
RackMountPro offers a 4U server with 102 bays, similar to the BackBlaze servers, which fully configured with 12GB drives is around $11k total and stores 1.2 PB per server. (https://www.rackmountpro.com/product.php?pid=3154)
That means that you could fit all 15TB (for erasure encoding with Minio) in less than two racks for around $150k up-front.
Figure another $5k/mo for monthly opex as well (power, bandwidth, etc.)
Instead of $12M spent after five years, you'd be at less than $500k, including traffic (also far cheaper than AWS.) Even if you got AWS to cut their price in half (good luck with that), you'd still be saving more than $5 million.
Getting the data out of AWS won't be cheap, but check out the snowball options for that: https://aws.amazon.com/snowball/pricing/
* To store 10+ PB of data.
* You need 15 PB of storage (running at 66% capacity)
* You need 30 PB of raw disks (twice for redundancy).
You're looking at buying thousands of large disks, in the order of a million dollar upfront. Do you have that sort of money available right now?
Maybe you do. Then, are you ready to receive and handle entire pallets of hardware? That will need to go somewhere with power and networking. They won't show up for another 3-6 months because that's the lead time to receive an order like that.
If you talk to Dell/HP/other, they can advise you and sell you large storage appliances. Problem is, the larger appliances will only host 1 or 2 PB. That's nowhere near enough.
There is a sweet spot in moving off the cloud, if you can fit your entire infrastructure into one rack. You're not in that sweet spot.
You're going to be filling multiple racks, which is a pretty serious issue in terms of logistics (space, power, upfront costs, networking).
Then you're going to have to handle "sharding" on top of the storage because there's no filesystem that can easily address 4 racks of disks. (Ceph/Lustre is another year long project for half a person).
The conclusion of this story. S3 is pretty good. Your time would be better spend optimizing the software. What is expensive? The storage or the bandwidth or both?
* If it's the bandwidth. You need to improve your CDN and caching layer.
* If it's the storage. You should work on better compression for the images and videos. And check whether you can adjust retention.
But you'll need to balance the cost of finding people with that level of knowledge and adaptability with the cost of bundled storage packages. We were running super lean, got great deals on bandwidth, power, and has low performance requirements. When we ran the numbers for all in costs, it was less than we thought we could get from any other vendor. And if you commit to buying the severs racks it will take to fit 10PB, you can probably get somebody like Quanta to talk to you.
1) Staff You'll need at least one, maybe two, to build, operate, and maintain any self-hosted solution. A quick peek on Glassdoor and Salary show the unloaded salary for a Storage Engineer runs $92,000-130,000 US. Multiply by 1.25-1.4 for loaded cost of an employee (things like FICA, insurance, laptop, facilities, etc). Storage Administrators run lower, but still around $70K US unloaded. Point is, you'll be paying around $100K+/year per storage staff position.
2) Facilities (HVAC, electrical, floor loading, etc) If you host on-site (not hosting facility), you'd better make certain your physical facilities can handle it. Can your HVAC handle the cooling, or will you need to upgrade it? What about your electrical? Can you get the increased electrical in your area? How much will your UPS and generator cost? Can the physical structure of the building (floor loading, etc) handle the weight of racks and hundreds of drives, the vibration of mechanical drives, the air cycling?
3) Disaster Recovery/Business Continuity Since you're using S3 One Zone IA, you have no multi-zone duplicated redundancy. It's use case is for secondary backup storage for data, not the primary data store for running a startup. When there is an outage/failure (and it will happen), the startup may be toast, and investors none too happy. So this is another expense you're going to have to seriously consider, whether you stick with S3 or roll-your-own.
4) Cost of money With rolling-your-own, you're going to be doing CAPEX and OPEX. How much upfront and ongoing CAPEX can the startup handle? Would the depreciation on storage assets be helpful financially? You really need to talk to the CPA/finance person before this. There may be better tax and financial benefits by staying on S3 (OPEX). Or not.
Good luck.
Maintaining such a (storage) cluster requires 1-2 people on site which replace a few hard disks every day.
Nevertheless, when I would continously need massive amount of data, I would opt in doing it myself anytime instead of cloud services. I just know how well these clusters run and there is little to no saving when outsourcing it.
This allows you to read the data into AWS instances at no cost and process it as needed since there is 0 cost for ingress into AWS. I have some experience with this (hosting using Equinix)
How are you storing this data? Is it tons of small objects, or a smaller number of massive objects?
If you can aggregate the small objects into larger ones, can you compress them? Is this 10PB compressed or not? If this is video or photo data, compression won't buy you nearly as much. If you have to access small bits of data, and this data isn't something like Parquet or JSON, S3 won't be a good fit.
Will you access this data for analytics purposes? If so, S3 has querying functionality like Athena and S3 Select. If it's instead for serving small files, S3 may not be a good fit.
Really, at PB scale these questions are all critically important and any one of them completely changes the article. There is no easy "store PB of data" architecture, you're going to need to optimize heavily for your specific use case.
Or a 648T raw HDD storage box for ~$53k
To get that up to raw 10 PB, I need ~$2m for all-SSD, or ~$850k for all-HDD
Bake-in a 2-system safety margin, and that's ~$2.3m all-SSD or ~$960 all-HDD
Run TrueNAS and ZFS on each of them ... and my overhead becomes a little bit of cross-over sysadmin/storage admin time per year and power
Say that's 1 FTE at $180k ($120k salary + 50% overhead) per year (even though actual admin time is only going to be maybe 10% of their workload - I like rounding-up for these types of approximations)
Peak cost, therefore, is ~$2.5m the first year, and ~$200k per year afterwards
And, of course, we'll want to plan for replacement systems to pop-in ... so factor-up to $250k per year in overhead (salary, benefits, taxes, power, budget for additional/replacement servers)
Using [Wasabi](https://wasabi.com/cloud-storage-pricing/#three-info), 10PB is going to run ~$62k/mo, or ~$744k per year
It's cheaper to build-vs-buy in no more than 5 years ... probably under 3
Wasabi and Glacier would be my 2nd choices.
It probably depends on if you are tied at the hip to other AWS services. If you are, then you're kind of stuck. The ingress/egress traffic will kill you doing anything with that data anywhere else.
If you aren't, the major players for on-prem S3 (assuming you want to continue access the data that way) would be (in no specific order):
Cloudian
Scality
NetApp Storagegrid
Hitachi Vantara HCP
Dell/EMC ECS
There are plusses and minuses to all of them. At that capacity I would honestly avoid a roll-your-own unless you're on a shoestring budget. Any of the above will be cheaper than Amazon.
[0] https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/ [1] https://dropbox.tech/machine-learning/cannes--how-ml-saves-u...
A 1U rack server attached to two JBODs(each 4U containing 60 spinning disks) connected to the server via 4 SAS HD cables. The rack server gets 512GiB of RAM to cache reads, and an Optane drive as persistent cache for writes. The usable storage depends on your redundancy and spare needs. But, as an example my setup - (9 * 6 drives(RAIDz2) + 4 hot spares) nets me about 450 TiB per JBOD or 900 TiB per rack server + two JBODs.
Repeat the setup by 6 times, and it would meet your 10 PB need. Throw in a few links 10GBps per server and have them all linked up by a switch, and you got your own storage setup. May be Minio(I have no experience with it) or something like that would give you a S3 interface over the whole thing.
I bet it would come out much cheaper than AWS. But, you’ve got to get your hands dirty a bit with system in work, and automate all the things with a tool like Ansible. Having done it, I’d say it is totally worth it at your scale.
We've recently switched to a setup with several Synology boxes for around 1PB net storage.
Is the data cold storage, that is rarely accessed? Is it OK to risk losing a percentage of it? Can you identify that percentage? If it's actively utilized, is it all used, or just a subset? Which subset? How much data is added every day? How much is deleted? What are the I/O patterns?
Etc.
I have direct experience moving big cloud datasets to on-site storage (in my case, RAID arrays), but it was a situation where the data had a long-tail usage pattern, and it didn't really matter if some was lost. YMMV.
If you’re looking for a partner/consultant to get things going, feel free to reach out! This stuff is sort of our wheelhouse, as me and my co-founder were previously Ops at Imgur, you can imagine the kinds of image hosting problems we’ve seen :P
The short story is, ignore most of the advice, poach^H^H^H^H^Hhire someone who has done this, and leverage their expertise. There is no armchair quarterbacking infrastructure at this scale.
Example is Blackblaze storage pod 6.0 according to them it holds 0.5PB with a cost of 10k$, you will need about 20*10K$ = 200K$ + Maintenance(They also publish failure rates) , The schematics and everything is in their website and according to them they have already a supplier who provides them with such devices which you could probably buy from. Note: This was published 2016, they probably have Pod 7.0 by now so cost may be better.
Reference: https://www.backblaze.com/blog/open-source-data-storage-serv...
If it's the former, then investing in-house might make sense (a la Dropbox's reverse course).
Since we're talking about images and videos, do you already have different quality of each media available? Maybe thumbnail, high quality, and full quality. It could allow you to use cold storage for the full quality media, serving the high quality version while waiting for retrieval.
If the use case is more of a backup/restore service and a restore typically takes longer than a cold storage retrieval (being Glacier or self hosted tape robot), then keep just enough in S3 to restore while you wait for the retrieval of the rest.
If you go the self-hosted route, I like software that is flexible around hardware failures. Something that will rebalance automatically and reduce the total capacity of the cluster, rather than require you to swap the drive ASAP. That way you can keep batch all the hardware swapping/RMA once per week/month/quarter.
https://www.ebay.com/itm/313012077673?_trkparms=aid%3D111000...
If it's all archival storage then it's pretty straight forward. If you're on GCP you take it all and dump it into archival single region DRA (Durable Reduced Availability) storage for the lowest costs.
Otherwise, identify your segments and figure out a strategy for "load balancing" between standard, nearline, coldline, and archive storage classes. If you can figure out a chronological pattern, you can write a small script that uses the gsutils built-in rsync feature to mirror over data from a higher grade storage class to a lower one at the right time.
The strategy will probably be similar in any of the other big 3 providers as well, but fair warning, some providers archival grade storage does not have immediate availability last I checked.
10PB seems like a lot to store in S3 buckets. I assume much of that data is not accessed frequently or would be used in a big data scenario. Maybe some other services like Glacier or RedShift (I think).
Consider looking at Nutanix - you can get the hardware from HPE (including Apollo).
Object storage from Nutanix doesn’t even break a sweat at 10PB of usable storage.
However the main reasons to look at Nutanix would be ease of use for
day 0 (bootstrapping) day 1 (administration operations, capacity management), fault tolerance and day n operations (upgrades, security patches etc)
Nutanix spends considerable time and resources on all this to make life of our customers easy.
2. As a general-purpose alternative, I would use Backblaze. It's cheap and they know what they're doing. Here is a comparison of (non-personal) cloud vendor storage prices: https://gist.github.com/peterwwillis/83a4636476f01852dc2b670...
3. You need to know how the architecture impacts the storage costs. There are costs for incoming traffic, outgoing traffic, intra-zone traffic, storage costs, archive costs, 'access' costs (cost per GET | POST | etc). You may end up paying $500K a month just to serve files smaller than 1KB.
4. You need to match up availability and performance requirements against providers' guarantees, and then measure a real-world performance test over a month. Some providers enforce rate limits, with others you might be in a shared pool of rate limits.
5. You need to verify the logistics for backup and restore. For 10PB you're gonna need an option to mail physical drives/tapes. Ensure that process works if you want to keep the data around.
6. Don't become your own storage provider. Unless you have a ton of time and money and engineering talent to waste and don't want to ship a reliable product soon.
Try being intentional and smart in front of your data pipeline and purge data that is not useful. Too many times people store data "just in case" and that case never happens years later.
While there's definitely a cross-over point where you should roll your own, the overhead costs of running a storage cluster reliably (and all the problem you don't really have to deal with because they're outsourced to AWS) mean it might be a better use of time and effort to see how much you can cut that number down by changing the parameters of your storage. The immediate savings will be much easier to justify.
Keep in mind you've also got a migration problem: getting 10PB off Amazon is not a simple, handsfree project.
> downloaded in the background to a mobile phone
and
> but less than 1000ms required
I'm struggling to think of what kind of application needs data access in the background with latency of less than 1000ms. That would normally be for interactive use of some kind.
Getting to 1 min access time would get you into the S3 glacier territory ... you will obviously have considered this but I feel like some really hard scrutiny on requirements could be critical here. With intelligent tiering and smart software you might make a near order of magnitude difference in cost and lose almost no user-perceptible functionality.
> The data is all images and videos, and no queries need to be performed on the data.
Okay, this is a good start, but there are some other important factors.
For every PB of data, how much bandwidth is used in a month, and what percentage of the data is actually accessed?
Annoyingly, the services that have the best warm/"cold" storage offerings also tend to be the services that overcharge the most for bandwidth.
Plus here we are not talking about 10pb but probably at 25 given redoundancy and probably also at 100pb ad more given the assumption that your company is growing , so a solution that cost slightly less today but will only do 2x when you do 10x would still be very interesting imo.. there is a lot to talk about ;)
Kinda disappointed the file solutions seem more complicated and nothing more simple to setup like some of the new databases are like CockroachDB or MongoDB are to use. I feel like reinventing the wheel is kinda bad as rather let people who are more experts in this field handle this stuff, but I hate the idea of vendor lock-in and forced to use other peoples servers, self hosting be nice from a single node to test to a cluster spanning multiple datacenters. Maybe there's a solution out there, I done some searching and just seems to go in circles. I seen one system but if you wanted to add or remove nodes in the future, you couldn't just "drain" a chunk server by moving it data.
However, you‘ll get to a point, where it‘s crucial to become profitable. And storing that much data does cost a lot of money using one of the mentioned providers.
So, when you think it‘s the right time to become “mature”, then get your own servers up and running using colocation.
What options do you have here (just a quick brainstorm): 1. Set up some servers, put in a lot of hard drives, format them using zfs and make it available using nfs on your network 2. Get some storage servers 3. Set up a Ceph cluster
I used to work as a CTO at a hosting company and evaluated all of these options and more. Every of these options comes with pros and cons.
Just one last advice: Evaluate your options and get some external help on this. Any of these options have pitfalls and you need experienced consultants to set up and run such an infrastructure.
All in all, it’s an invest, that will save you a lot of money and will give you freedom and flexibility to grow further.
P.S. we ended up setting up a Ceph cluster. We found a partner, who’s specialized on hosting custom infrastructures. That partner is responsible for all the maintenance, so we could focus on the product itself.
If you're not afraid of having a few operations people on staff and running a few racks in multiple data centers, then buy a bunch of drives and servers and install something to expose everything via S3 interface (Ceph, Minio, ...) so none of your tools have to change.
Way cheaper than AWS, and a lot less headache than trying to run it all yourself.
surprised I didn't see Gluster already in this thread. maybe its not for such big scale?
edit: Wikipedia says " GlusterFS to scale up to several petabytes on commodity hardware"
Wasabi's Reserved Capacity Storage is likely to be the cheapest: https://wasabi.com/rcs/
If you front it with Cloudflare, egress would be close to free given both these companies are part of the Bandwidth Alliance: https://www.cloudflare.com/bandwidth-alliance/
Cloudflare has an images product in closed beta, but that is likely unnecessary and probably expensive for your usecase: https://blog.cloudflare.com/announcing-cloudflare-images-bet...
--
If you're curious still, take a look at Facebook's F4 (generic blob store) and Haystack (for IO bound image workloads) designs: https://archive.is/49GUM
What are your access patterns? You say "no queries need to be performed," but are you accessing via key-value look-ups? Or ranged look-ups?
What do customers do with the pictures? Do customers browse through images and videos?
You mention it's "user generated data" - how many users (order of magnitude)? How often is new data generated? Does the dataset grow, or can you evict older images/videos (so you have a moving window of data through time)?
Besides your immediate needs, what other needs do you anticipate? (Will you need to do ML/Analytics work on the data in the future? Will you want to generate thumbnails from the existing data set?)
What my experience is based on: I was formerly Senior Software Engineer/Principal Engineer for a team that managed reporting tools for internal reporting of Amazon's Retail data. The team I was on provides tools for accessing several years worth of Amazon.com's order/shipment data.
I'd recommend reaching out to some data eng in the various Bigs, they certainly have more clear numbers. Happy to make an intro if you need, feel free to dm me.
That said, if you really think you must, spend effort on good deduping/transcoding (relatively easy with images/video), and consider some far lower-cost storage options than S3, which is pretty pricey no matter what you do. If S3 is a good fit, I hear good things about Wasabi, but haven't used it myself.
If you have the technical ability (non-trivial, you need someone who really understands, disk and system I/O, RAID Controllers, PCI lane optimization, SAN protocols and network performance (not just IP), etc.) and the wherewithal to invest, then putting this on good hardware with something like say, ZFS at your site or a good co-lo will be WAY cheaper and probably offer higher performance than any other option, especially combined with serious deduping. (Look carefully at everything that comes in once and you never have to do it again.) Also, keep in mind that even-numbered RAID levels can make more sense for video streaming, if that's a big part of the mix.
The MAIN thing: Keep in mind that understanding your data flows is way more important than just "designing for scale". And really try to not need so much data in the first place.
(Aside: I'm was cofounder and chief technologist of one of the first onsite storage service providers - we built a screamer of a storage system that was 3-4x as fast, and scaled 10x larger than IBM's fastest Shark array, at less than 10% of the cost. The bad news - we were planning to launch the week of 9/11 and, as self-funded, ran out of money before the economy came back. The system kicked ass, though.)
For example, at 10PB with every object duplicated twice (so 20 PB raw storage), you’d need ~90 of their SX293[1] boxes, coming out to around €30k/mo. This doesn’t include time to configure/maintain on your end, but it does cover any costs associated with drive replacement for failure.
I’ve done similar setups for cheap video storage & CDN origin systems before, and it’s worked fairly well if you’re cost conscious.
[1] https://www.hetzner.com/dedicated-rootserver/sx293/configura...
* network access - do you have data that will be accessed frequently, and with high traffic? You need to cover this skewed access pattern in your solution.
* data migration from one node to another, etc...
* ability to restore quickly in case of failure.
I would suggest to:
* use some open-source solution on top of the hosted infrastructure (Hetzner or similar is a good choice)
* bring in a seasoned expert to analyze your data usage/storage patterns, maybe there are some other ways to make storage more cost effective, that simply moving out of AWS S3.
Disclaimer: I'm working at a consultancy/partner for a competing cloud.
https://www.backblaze.com/blog/open-source-data-storage-serv...
There Storage Pod 6.0 can hold up to 480TB per server.
And it already supports S3 API, and other HTTP, FUSE, WebDAV, Hadoop, etc.
There should be many existing hardware options that is much cheaper than AWS S3.
It’s api is S3 compliant.
And also I believe they have minimal cost for transferring data from S3 into wasabi , so initial setup cost should be lower too.
It should be relatively cheaper than self hosting too , when you account for hidden costs that comes with self hosting , related to managing additional employees , having protocols in place for recovering from faults , expanding the storage as you go , maintaining existing infrastructure , etc.
You can compare the prices with respect to S3 at
Can you afford the up-front costs of the hardware needed to run the solutions you may want to run?
Will those solutions have good enough data locality to be useful to you?
It isn't real useful to have all your data on-site, and then you operations in the cloud. You've introduced many new layers that can fail.
If you go on-prem, the solution to look at is likely Ceph.
Source: Storage Software Engineer, who has spoken at SNIA SDC. I currently maintain a "small" 1PB ceph cluster at work.
Recommendation: Get someone who knows storage and systems engineering to work with you on the project. Even if you decide not to move, understanding why is the most important part.
- Paying for physical space and facilities
- Paying people to maintain it
- Paying for DRP/BCP
- Paying periodically since it doesn't last forever so it'll need replacements
But if you were to have to move out of AWS but Azure and GCP aren't options, you can do: Ceph and HDDs. Dual copies of files so you have to lose three drives for any specific file to have (only those files) dataloss. Does not come with versioning or full IAM-style access control or webservers for static files (which you get 'for free' with S3).
HDDs don't need to be in servers, they can be in drive racks, connected with SAS or iSCSI to servers. This means you only need a few nodes to control many harddisks.
A more integrated option would be (As suggested) back blaze pod-style enclosures, or storinator type top loaders (supermicro has those too). It's generally 4U rack units for 40 to 60 3.5" drives, which again generally comes to about 1PB per 4U. A 48U rack holds 11 units when using side-mounted PDUs, a single top-of-rack switch and no environmental monitoring in the rack (and no electronic access control - no space!).
This means that for redundancy you'd need 3 racks of 10 units. If availability isn't a problem (1 rack down == entire service down) you can do 1 rack. If availability is important enough that you don't want downtime for maintenance, you need at least 2 racks. Cost will be about 510k USD per rack. Lifetime is about 5 to 6 years but you'll have to replace dead drives almost every day at that volume, which means an additional 2000 drives over the lifespan, perhaps some RAM will fail too, and maybe one or two HBAs, NICs and a few SFPs. That's about 1.500.000 spare parts over the life of the hardware, not including the racks themselves, not including power, cooling or physical facilities to locate them.
Note: all of the figures above are 'prosumer' class and semi-DIY. There are vendors that will support you partially, but that is an additional cost.
I'm probably repeating myself (and others) here, but unless you happen to already have most of this (say: the people, skills, experience, knowledge, facilities, money upfront and money during its lifecycle), this is a bad idea and 10PB isn't nearly enough to do by yourself 'for cheaper'. You'd have to get into the 100PB or more arena to 'start' with this stuff if you need to get all of those externalities covered as well (unless it happens to be your core business, which from the opening post it doesn't seem to be).
A rough S3 IA 1Z calculation shows a worst-case cost of about 150.000 USD monthly, but at that rate you can get a lot of cost savings, and with some smart lifecycle configuration you can get that down as well. This means that doing it yourself vs. letting AWS do it makes AWS half as expensive.
Calculation as follows:
DIY: at least 3 racks to match AWS IA OneZone (you'd need 3 racks on 3 different locations, a total of 9 racks to have 3 zones but we're not doing that as per your request) which means the initial starting cost is a minimum of 1.530.000 and combined with a lifetime cost of at least 1.500.000, over 5 years, if we're lucky, so about 606.000 per year, just for the contents of racks that you have to already have.
Adding to this, you'd have some average colocation costs, no matter if you have an entire room, a private cage or a shared corridor. That's at least 160U and in total at least 1400VA per 4U (or roughly 14A at 120V). That amount of power is what a third of a normal rack might use on its own! Roughly, that will boil down to a monthly racking cost of 1300USD per 4U if you use one of those colocation facilities. That's another ~45k per month, at the very least.
So no-personnel colocated can be done, but doing all that stuff 'externally' is expensive, about 95.500 every month, with no scalability, no real security, no web services or load balancing etc.
That means below-par features gets you a rough saving of 50k monthly if you didn't need any personnel and nothing breaks 'more' than usual. And you'd have to not use any other features in S3 besides storage. And if you use anything outside of the datacenter you're located (i.e. if you host an app in AWS EC2, ECS or a lambda or something) and you need a reasonable pipe between your storage and the app, that's a couple of K's per month you can add, eating into the perceived savings.
If you're storing images and videos directly from the phone, they can be downsampled drastically without losing quality on a viewing device that anyone's likely to have.
It's unlikely that anyone wants to download the full size copy, and if they do, they can wait a few hours for Glacier.
You could expose this to the customer, e.g. offer direct access of originals at 2x or 5x the price. But 99.9% of people will be OK with immediate access to quality images/video and eventual access to the unmodified originals.
I personally would consider S3 Glacier+CloudFront, member of Bandwidth Alliance [0] of your choice+CloudFlare, and whomever serves TikTok now.
Hiring staff to build this would make sense at this point, because if your S3 storage cost is really $200,000/month, you can hire 3 good engineers with $450,000/year, which is the cost of just two months of S3 storage.
As an aside, you can often get nice credits for moving off of AWS to Azure or GCP. I recommend the later.
We store north of 2PB with AWS and have just committed to an agreement that will increase that commitment based on some competitive pricing they've given us.
Give me a shout if you'd like to chat.
As for my own storage I use 1TB SanDisc SD cards in a raspberry 2 cluster for write once data (user) and 8x64GB 50nm SATA drives from 2011 on 2xAtom 8-core for data that changes all the time! Xo
People say that content is king, I think that final technology (systems that don't need rewriting ever) is king and content has peaked! ;)
Also this is a startup, no? A million or so in storage so you need not preoccupy your startup with having to deal with failing disks, disk provisioning, collocation costs, etc. etc. not to mention the 11 9s of durability you get with S3, to me it just makes the most sense to do this on the cloud.
Changing that can be very very difficult for not much gain. Plus AWS skills are very easy to recruit for vs Google cloud.
There is S3-like interface, so you may just change access key, and region host: https://www.backblaze.com/b2/docs/s3_compatible_api.html
We ended up buying lots of Supermicro's ultra dense servers [1]. That's a 3U box, containing 24 servers that are interconnected with internal switches (think: 1 box is a self-contained mini cloud). Each server has (cheap config) 1 CPU 4 Xeon cores, 32GB ram, 4TB disk.
Those were bought & hosted in China, and IIRC price tag was around $20k USD per box. That's 96TB per 3U, or >1.2PB and ~$200k per rack. We had a lot of racks in multiple datacenters. These days capacity can be much larger, e.g.: 6TB disk, 144TB per 3U and >1.8PB per rack.
We've tried Ceph, GlusterFS, HDFS, even early versions of Citus, and pretty much everything that existed and was maintain during that time. We eventually settled on Cassandra. It required 2 people to maintain the software, and 1 for the hardware.
Today, I would have done the same hardware setup, mainly because I haven't had 1 Supermicro component fail on me since I bought them first in early 2000s. Cassandra would've been replaced by FoundationDB. I've been using FoundationBD for awhile now, and it just works: zero maintenance, incredible speeds, multi datacenter replication, etc.
Alternatively, if I needed storage without processing, but with fast access, I'd probably go with Supermicro's 4U 90 bay pods [2]. That'd be 90*16TB, 1.4PB in 4U, or ~14PB per rack. And FoudnationDB, no doubt.
As a fun aside: back then, we also tried Kinetic Ethernet Attached Storage [3]. Great idea but what a pain in the rear it was. We did however have a very early access device. No idea if it's still in production or not.
[1] https://www.supermicro.com/en/products/system/3U/5038/SYS-50...
[2] https://www.supermicro.com/en/products/system/4U/6048/SSG-60...
[3] https://www.supermicro.com/products/nfo/files/storage/d_SSG-...
For online or nearline storage, you should look at what Backblaze did. Either buy hardware that is similar to what they did (basically disk shelves, you can cram ~100 drives into a 4U chassis) or if you are at that scale you can probably build your own just like they did.
Chances are you don't need all of it. Every company today thinks they need "Big Data" to do their theoretical magic machine learning, but most of them are wrong. Hoarding petabytes of worthless data doesn't make you Facebook.
To be a little less glib, I'd start by auditing how much of that 10PB actually matters to anyone.
(I'm not working there anymore, posting this just to help)
Because you must be able to deal with Ceph quirks.
If you can shard your data over multiple independent stand-alone ZFS boxes, that would be much simpler and more robust. But it might not scale like Ceph.
The only issue is whether or not you have a CDN in front of this data. If you do then backblaze might not be much cheaper than S3->Cloudfront. You'd save storage costs but easily exceed those savings in egress.
I only read about it, but never used it.
It advertises itself as exabyte scalable and provides s3 and nfs access.
http://web.archive.org/web/20201128103953/https://blog.ampli...
They're basically 100% S3-compatible.
I don't know the details of their pricing, but they're production grade in the reald sense of the word.
I am not affiliated with them in any way, but I interviewed with them a couple of years ago and left with a good impression.
Everyone that wants to make extra money can join
You join with your computer hooked up to internet, a piece of software running in background
You share % of your hard-drive and limit speed that can be used to upload/download
When someone needs to store 100PB of data ("uploader"), they submit a "contract" on a blockchain - they also setup what's the redundancy rate, meaning how many copies need to be spread to guarantee consistency of data as a whole
The "uploader" shares a file - the file is being chop in chunks and each chunk being encrypted with uploader private PHP key. The info re chunks are uploaded to blockchain and everyone get a piece. In return, all parties that keep piece of uploader data get paid small % either via PayPal or simply in crypto.
I think that would be a cool project, but someone would have to do back-of-napkin number crunching if that would be profitable enough to data hoarders :)
Not worth the risk or why?
(In 1998, in school, I looked up in our math book what would come after mega, giga... 20 years later, just as fresh and useless as on day one ;))
Perhaps it’s a mix of some app pattern changes and leveraging the storage tier options in AWS to reduce your cost.
Escherichia coli, for instance, has a storage density of about 10 to the 19 bits per cubic centimeter. At that density, all the world’s current storage needs for a year could be well met by a cube of DNA measuring about one meter on a side.
There are several companies doing it: https://www.scientificamerican.com/article/dna-data-storage-...
https://file.app/ https://docs.filecoin.io/build/powergate/
(Disclosure: I am indirectly connected to filecoin, but interested in genuine answers)
cat >/dev/null, obviously. ;-)
How often you access data is another question.
Feel free to ama on it, I'm a huge fan
If you place a high value on engineering velocity and you already rely on managed services, then I would look to stay in S3. Do the legwork to gather competitive bids (GCS, Azure, maybe one second tier option) and use that in your price negotiation. Negotiation is a skill, so depending on the experience in your team, you may have better or worse results -- but it should be possible to get some traction if you engage in good faith with AWS.
There is a considerable opportunity cost to moving that data to another cloud provider. No matter how well you plan and execute it, you're going to lose some amount of velocity for at least several months. In a worse scenario, you are running two parallel systems for a considerable amount of time and have to pay that overhead cost on your engineering team's productivity. In the worst case scenario, you experience service degradation or even lose customer data. It's quite easy for 2-3 months to turn into 2-3 years when other higher priority requirements appear, and it's also easy for unknowns to pop up and complicate your migration.
With all of that said, if the fully baked cost of migrating to another cloud provider (engineering time + temporary migration services + a period of duplicated costs between services + opportunity cost) is trajectory changing for your business, then it certainly can be done. I feel like GCS is a bit better of a product vs S3, although S3 has managed to iron out some of its legacy cruft in the last few years. Azure is not my cup of tea. I have never seriously considered any other vendors in the space, although there are many.
Your other option is to build it. I've done it several times, people do it every day. You may need someone on the team who either has or can grow the skillset you're going to need: vendor negotiation, capacity planning, hardware qualification, and other operational tasks. You can save a bunch of money, but the opportunity cost can be even greater.
10PB is the equivalent of maybe 1-2 rack of servers in a world where you can easily get 40-50 drive systems with 10-18TB drives (of course for redundancy you would need more like 2-2.5x, and you need space to grow into so that you're always ahead of your user growth curve). At any rate, my point is that the deployment isn't particularly large, so you aren't going to see good economies of scale. If you expect to be in the 100+PB range in 6-12 months, this could still be the right option.
Personally, I would look to build a service like this in S3 and migrate to on-premise at an inflection point probably 2 orders of magnitude above yours, if the future growth curve dictated it. The migration time and cost will be even more onerous, but the flexibility while finding product/market fit probably countermands the cost overhead.
There is a third option, which is hosted storage where someone else runs the machines for you. Personally I see it as a stop-gap solution on the path to running the machines yourself, and so it's not very exciting. But it is a way to minimize your investment before fully committing.
1. Do you have paying customers already?
2. Can the startup weather large capex? does opex work better for you?
3. Do you already have staff with sufficient bandwidth to support this, or will you need to hire?
4. What are the access patterns for the data?
5. What is the data growth rate?
6. What is the cost of losing some, or all of this data?
7. What is your expected ROI?
TL;DR - storing and serving up the data is the easy part.
I have no idea how you evaluate the necessity of keeping the data safe, and that plays a huge factor in deciding what's appropriate. Amazon S3 makes it a no-brainer for having your data safe across failure domains. Of course, the same can be done with non-S3 solutions, but someone has to set it all up, test it, and pay for it.
My background in storage is mostly related to working with Ceph and Swift (both OpenStack Swift and SwiftStack) while being employed by various hardware vendors.
Some thoughts on Ceph: - In my opinion, Ceph is better suited for block storage than object storage. To be fair, it does support object storage with use of the Rados Gateway (RGW) and RGW does support the S3 API. However, Ceph has a strong consistency model and in my opinion, strong consistency tends to be better suited to block storage. Why is this? For a 10PB cluster (or larger), failures of various types will be the norm (mostly disk failures). What does Ceph do when a disk fails? It goes to work right away to move whatever data was on the failed disk (using its redundant copies/fragments) to a new place. No big deal if it's only a single HDD that's in failed status at any given point of time. What if you have a server, disk controller, or drive shelf fail? You get a whole bunch of data backfilling going on all at once. The other consideration with strong consistency model is having multi-site storage. Not so good for strong consistency model (due to higher latency for inter-site communication). - Ceph has a ton of knobs, is very feature rich, and high on complexity (although it has improved). The open-source mechanisms for installing and the admin tools have experienced (and continue to have) a high-rate of churn. Do a quick search on how to install/deploy Ceph and you'll see multiple. Same with admin tools. Should you strongly consider Ceph as an option, I would strongly advise you to license and use one of the 3rd party software suites that (a) take the pain away from install/deploy/admin, and (b) reduce the amount of deep expertise that you would need to keep it running successfully. Examples of these 3rd party Ceph admin suites are Croit [0] and OSNEXUS [1]. Alternatively, if you like the idea of a Ceph appliance, I would take a close look at SoftIron [2].
Aside from Ceph, it's worth taking a very close look at OpenStack Swift [3][4]. It's only object storage and has been around for about 10 years. It supports the S3 protocol and also has its own Swift protocol. It's open source and it has an eventually consistent data model. Eventually consistent is (IMO) a much better fit for a 10+PB cluster of objects. Why is this? Because failures can be handled with less urgency and at more opportune times. Additionally, an eventually consistent model makes multi-site storage MUCH easier to deal with.
I suggest going further and spending some quality time with the folks at SwiftStack [5]. Object storage is their game and they're very good at it. They can also help with on-prem vs hosted vs hybrid deployments.
Additionally, you would definitely want to use erasure coding (EC) as opposed to full replication. This is easy enough to do with either Swift or Ceph.
Disclaimers and disclosures - I am not currently (nor have ever been) employed by any of the companies I mentioned above.
Dell EMC Technical Lead and co-author of these documents:
Dell EMC Ready Architecture for Red Hat Ceph Storage 3.2 - Object Storage Architecture [6]
Dell EMC Ready Architecture for SwiftStack Storage - Object Storage Architecture Guide [7]
Intel co-author of this document: "Accelerating Swift with Intel Cache Acceleration Software" [8]
[0] https://croit.io
[1] https://www.osnexus.com/technology/ceph
[2] https://softiron.com
[3] https://wiki.openstack.org/wiki/Swift
[4] https://github.com/openstack/swift
[5] https://www.swiftstack.com
[6] https://www.delltechnologies.com/resources/en-us/asset/technical-guides-support-information/solutions/red_hat_ceph_storage_v3-2_object_storage_architecture_guide.pdf
[7] https://infohub.delltechnologies.com/section-assets/solution-brief-swiftstack-1
[8] https://www.intel.sg/content/www/xa/en/software/intel-cache-acceleration-software-performance/intel-cache-acceleration-software-performance-accelerating-swift-white-paper.html
If this isn't already something that your company is familiar with, you'll need people who know how to buy, build, test and manage infrastructure across datacentres, including servers and core networking. Understanding platforms like Linux will be critical, as well as monitoring and logging solutions (perhaps like Prometheus and Elastic).
The only solution that I know of which would scale to your requirements would be OpenStack Swift (https://wiki.openstack.org/wiki/Swift). It's explicitly designed as an eventually consistent object store which makes it great for multi-region, and it scales. It is Apache 2.0 licensed, written in Python with a simple REST API (plus support for S3).
The Swift architecture is pretty simple. It has 4 roles (proxy, account, container and object) which you can mix and match on your nodes and can scale independently. The proxy nodes handle all your incoming traffic like retrieving data from clients and sending it onto the object nodes and vice versa. Proxy nodes can be addressed independently rather than through a load balancer and is one of the ways Swift is able to scale out so well. You could start with three and go up to dozens across regions, as required.
The object nodes are pretty simple, they are also Linux machines with a bunch of disks each formatted with a simple XFS file system where they read and write data. Whole files are stored on disk but very large files can be sharded automatically and spread across multiple nodes. You can use replication or erasure coding and the data is scrubbed continuously, so if there is a corrupt object it will be replaced automatically.
Data is automatically kept on different nodes to avoid loss for when a node dies, in which case new copies of the data are made automatically from existing nodes. You can also configure regions and zones to help determine the placement of data across the wider cluster. For example, you could say you want at least one copy of an object per datacentre.
I know that many large companies use Swift and I've personally designed and built large clusters of over 100 nodes (with SwiftStack product) across three datacentres. This gives us three regions (although we mostly use two) and we have a few different DNS entries as entry points into the cluster. For example, we have one at swift.domain.com which resolves to 12 proxy nodes across each region, then others which resolves to proxy nodes in one region only, e.g. swift-dc1.domain.com. This way users can go to a specific region if they want to, or just the wider cluster in general.
We used Linux on commodity hardware, stock 2RU HPE servers with 12 x 12 TB drives (so total cluster size is ~14PB raw), but I'm sure there's a better sweet spot out there. You could also create different types, higher density or faster disk as required, perhaps even an "archive" tier. NVMe is ideal for the account and container services, the rest can be regular SATA/NL-SAS. You want each drive to be addressed individually, so no multi-disk RAID arrays however each of our drives sits on its own single member RAID-0 array in order to make use some caching from the RAID controller (so 12 x RAID-0 arrays per object node).
Our cluster nodes connect to Cisco spine and leaf networking and have multiple networks; e.g. the routeable frontend network for accessing the proxy nodes, private cluster network for accessing objects and the replication network for sending objects around the cluster.
Ceph is another open source option and while I love it as block storage for VMs, I’m not convinced that it’s quite the right design for a large, distributed object store. Compared to Swift object store seems more of an after thought and inherits a system designed for blocks. For example, it is synchronous and latency sensitive, so multi-region can be tricky. Could still be worth looking into, though.
Given the size of your data and ongoing costs of keeping it in AWS, it might be worthwhile investing in a small proof of concept with Swift (and perhaps some others). If you can successfully move your data onto your own infrastructure I'm sure you can not only save money but be in better control overall.
I've worked on upstream OpenStack and I'm sure the community would be very welcoming if you wanted to go that way. Swift is also just a really great piece of technology and I love seeing more people using it :-) Feel free to reach out if you want more details or some help, I'll be glad to do what I can.
I am not from Oracle and I am also running startup with growing pains. Oracle is a bit late to the Cloud game so they are loading up customer's base now and squeezing ears will come in 3-5 years down the road. Maybe you can take advantage of this.