📣 philippb

How would you store 10PB of data for your startup today?

I'm running a startup and we're storing north of 10PB of data and growing. We're currently on AWS and our contract is up for renewal. I'm exploring other storage solutions.
Min requirements of AWS S3 One Zone IA (https://aws.amazon.com/s3/storage-classes/?nc=sn&loc=3)
How would you store >10PB if you'd be in my shoes? Thought experiment can be with and without data transfer cost our of current S3 buckets. Please mention also what your experience is based on. Ideally you store large amounts of data yourself and speak of first hand experience.
Thank you for your support!! I will post a thread once we got to a decision on what we ended up doing.
Update: Should have mentioned earlier, data needs to be accessible at all time. It’s user generated data that is downloaded in the background to a mobile phone, so super low latency is not important, but less than 1000ms required.
The data is all images and videos, and no queries need to be performed on the data.

👤 pmlnr Accepted Answer ✓

Non-cloud:
HPE sells their Apollo 4000[^1] line, which takes 60x3.5" drives - with 16TB drives, that's 960TB each machine, one rack of 10 of these is 9PB+ therefore, which nearly covers your 10PB needs. (We have some racks like this). They are not cheap. (Note: Quanta makes servers that can take 108x3.5" drive, but they need special deep racks.)
The problem here would be the "filesystem" (read: the distributed service): I don't have much experience with Ceph, and ZFS across multiple machines is nasty as far as I'm aware, but I could be wrong. HDFS would work, but the latency can be completely random there.
[^1]: https://www.hpe.com/uk/en/storage/apollo-4000.html
So unless you are desperate to save money in the long run, stick to the cloud, and let someone else sweat about the filesystem level issues :)
EDIT: btw, we let the dead drives "rot": replacing them would cost more, and the failure rate is not that bad, so they stay in the machine, and we disable them in fstabs, configs, etc.
EDIT2: at 10PB HDFS would be happy; buy 3 racks of those apollos, and you're done. We started struggling at 1000+ nodes first; now, with 2400 nodes, nearly 250PB raw capacity, and literally a billion filesystem objects, we are slow as f*, so plan carefully.

👤 skynet-9000

At that kind of scale, S3 makes zero sense. You should definitely be rolling your own.
10PB costs more than $210,000 per month at S3, or more than $12M after five years.
RackMountPro offers a 4U server with 102 bays, similar to the BackBlaze servers, which fully configured with 12GB drives is around $11k total and stores 1.2 PB per server. (https://www.rackmountpro.com/product.php?pid=3154)
That means that you could fit all 15TB (for erasure encoding with Minio) in less than two racks for around $150k up-front.
Figure another $5k/mo for monthly opex as well (power, bandwidth, etc.)
Instead of $12M spent after five years, you'd be at less than $500k, including traffic (also far cheaper than AWS.) Even if you got AWS to cut their price in half (good luck with that), you'd still be saving more than $5 million.
Getting the data out of AWS won't be cheap, but check out the snowball options for that: https://aws.amazon.com/snowball/pricing/

👤 user5994461

What if you want to move off S3? Let's do the math.
* To store 10+ PB of data.
* You need 15 PB of storage (running at 66% capacity)
* You need 30 PB of raw disks (twice for redundancy).
You're looking at buying thousands of large disks, in the order of a million dollar upfront. Do you have that sort of money available right now?
Maybe you do. Then, are you ready to receive and handle entire pallets of hardware? That will need to go somewhere with power and networking. They won't show up for another 3-6 months because that's the lead time to receive an order like that.
If you talk to Dell/HP/other, they can advise you and sell you large storage appliances. Problem is, the larger appliances will only host 1 or 2 PB. That's nowhere near enough.
There is a sweet spot in moving off the cloud, if you can fit your entire infrastructure into one rack. You're not in that sweet spot.
You're going to be filling multiple racks, which is a pretty serious issue in terms of logistics (space, power, upfront costs, networking).
Then you're going to have to handle "sharding" on top of the storage because there's no filesystem that can easily address 4 racks of disks. (Ceph/Lustre is another year long project for half a person).
The conclusion of this story. S3 is pretty good. Your time would be better spend optimizing the software. What is expensive? The storage or the bandwidth or both?
* If it's the bandwidth. You need to improve your CDN and caching layer.
* If it's the storage. You should work on better compression for the images and videos. And check whether you can adjust retention.

👤 epistasis

If you have good sysadmin/devops types, this is a few racks of storage in a datacenter. Ceph is pretty good at managing something this size, and offers an S3 interface to the data (with a few quirks). We were mostly storing massive keys that were many gigabytes, so if you have smaller keys, so I'm not sure about performance/scalding limits with smaller keys and 10PB. I'd be sure to give your team a few months to build a test cluster then build and scale the full size cluster. And a few months to transfer the data...
But you'll need to balance the cost of finding people with that level of knowledge and adaptability with the cost of bundled storage packages. We were running super lean, got great deals on bandwidth, power, and has low performance requirements. When we ran the numbers for all in costs, it was less than we thought we could get from any other vendor. And if you commit to buying the severs racks it will take to fit 10PB, you can probably get somebody like Quanta to talk to you.

👤 maestroia

There are four hidden costs which not many have touched upon.
1) Staff You'll need at least one, maybe two, to build, operate, and maintain any self-hosted solution. A quick peek on Glassdoor and Salary show the unloaded salary for a Storage Engineer runs $92,000-130,000 US. Multiply by 1.25-1.4 for loaded cost of an employee (things like FICA, insurance, laptop, facilities, etc). Storage Administrators run lower, but still around $70K US unloaded. Point is, you'll be paying around $100K+/year per storage staff position.
2) Facilities (HVAC, electrical, floor loading, etc) If you host on-site (not hosting facility), you'd better make certain your physical facilities can handle it. Can your HVAC handle the cooling, or will you need to upgrade it? What about your electrical? Can you get the increased electrical in your area? How much will your UPS and generator cost? Can the physical structure of the building (floor loading, etc) handle the weight of racks and hundreds of drives, the vibration of mechanical drives, the air cycling?
3) Disaster Recovery/Business Continuity Since you're using S3 One Zone IA, you have no multi-zone duplicated redundancy. It's use case is for secondary backup storage for data, not the primary data store for running a startup. When there is an outage/failure (and it will happen), the startup may be toast, and investors none too happy. So this is another expense you're going to have to seriously consider, whether you stick with S3 or roll-your-own.
4) Cost of money With rolling-your-own, you're going to be doing CAPEX and OPEX. How much upfront and ongoing CAPEX can the startup handle? Would the depreciation on storage assets be helpful financially? You really need to talk to the CPA/finance person before this. There may be better tax and financial benefits by staying on S3 (OPEX). Or not.
Good luck.

👤 ktpsns

I have worked in HPC (academia) where the cluster storage size is measured in multiples of PB since a decade. Since latency and bandwidth is a killer requirement there, Infiniband (instead of Ethernet) is the defacto standard for connecting the storage pools to the computing nodes.
Maintaining such a (storage) cluster requires 1-2 people on site which replace a few hard disks every day.
Nevertheless, when I would continously need massive amount of data, I would opt in doing it myself anytime instead of cloud services. I just know how well these clusters run and there is little to no saving when outsourcing it.

👤 jtchang

I would host in a datacenter of your choice and do a cross connect into AWS: https://aws.amazon.com/directconnect/pricing/
This allows you to read the data into AWS instances at no cost and process it as needed since there is 0 cost for ingress into AWS. I have some experience with this (hosting using Equinix)

👤 staticassertion

It's going to depend entirely on a number of factors.
How are you storing this data? Is it tons of small objects, or a smaller number of massive objects?
If you can aggregate the small objects into larger ones, can you compress them? Is this 10PB compressed or not? If this is video or photo data, compression won't buy you nearly as much. If you have to access small bits of data, and this data isn't something like Parquet or JSON, S3 won't be a good fit.
Will you access this data for analytics purposes? If so, S3 has querying functionality like Athena and S3 Select. If it's instead for serving small files, S3 may not be a good fit.
Really, at PB scale these questions are all critically important and any one of them completely changes the article. There is no easy "store PB of data" architecture, you're going to need to optimize heavily for your specific use case.

👤 garciasn

In my opinion, knowing what you're planning to do w/the data once it's stored is the important piece to giving you some idea of where to put it.

👤 warrenm

I can build a 720T raw SSD storage box for ~$138k
Or a 648T raw HDD storage box for ~$53k
To get that up to raw 10 PB, I need ~$2m for all-SSD, or ~$850k for all-HDD
Bake-in a 2-system safety margin, and that's ~$2.3m all-SSD or ~$960 all-HDD
Run TrueNAS and ZFS on each of them ... and my overhead becomes a little bit of cross-over sysadmin/storage admin time per year and power
Say that's 1 FTE at $180k ($120k salary + 50% overhead) per year (even though actual admin time is only going to be maybe 10% of their workload - I like rounding-up for these types of approximations)
Peak cost, therefore, is ~$2.5m the first year, and ~$200k per year afterwards
And, of course, we'll want to plan for replacement systems to pop-in ... so factor-up to $250k per year in overhead (salary, benefits, taxes, power, budget for additional/replacement servers)
Using [Wasabi](https://wasabi.com/cloud-storage-pricing/#three-info), 10PB is going to run ~$62k/mo, or ~$744k per year
It's cheaper to build-vs-buy in no more than 5 years ... probably under 3

👤 nikisweeting

Backblaze B2, ingress and egress are free through Cloudflare, and it's S3 compatible. It's peanuts by comparison but I've been storing ~22TB on there for years and love it.
Wasabi and Glacier would be my 2nd choices.

👤 tw04

I should preface this with: I read the question as you want something on-premises/in a colo. If you're talking hosted S3 by someone other than Amazon that's a different story.
It probably depends on if you are tied at the hip to other AWS services. If you are, then you're kind of stuck. The ingress/egress traffic will kill you doing anything with that data anywhere else.
If you aren't, the major players for on-prem S3 (assuming you want to continue access the data that way) would be (in no specific order):
Cloudian
Scality
NetApp Storagegrid
Hitachi Vantara HCP
Dell/EMC ECS
There are plusses and minuses to all of them. At that capacity I would honestly avoid a roll-your-own unless you're on a shoestring budget. Any of the above will be cheaper than Amazon.

👤 babelfish

I assume you're already making use of most of S3s auto-archive features?[0] Really it seems like this comes down to how quickly any of your data /needs/ to be loaded. I'd probably investigate after how much time a file is only ~1-10% likely to be accessed in the next 30 days, then auto-archive files in S3 to Glacier after that threshold. If you want to be a bit 'smarter' about it, here's an article by Dropbox[1] on how they saved $1.7M/year by determining which file previews actually need to be generated, and their strategy seems like it could be applied to your use case. That said, it seems like you are more likely to save money by going colo than by staying in the cloud.
[0] https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/ [1] https://dropbox.tech/machine-learning/cannes--how-ml-saves-u...

👤 reacharavindh

I have done 2 PB HPC data storage with ZFS. If I may extrapolate, I don’t see why it wouldn’t workout the same for 10 PB.
A 1U rack server attached to two JBODs(each 4U containing 60 spinning disks) connected to the server via 4 SAS HD cables. The rack server gets 512GiB of RAM to cache reads, and an Optane drive as persistent cache for writes. The usable storage depends on your redundancy and spare needs. But, as an example my setup - (9 * 6 drives(RAIDz2) + 4 hot spares) nets me about 450 TiB per JBOD or 900 TiB per rack server + two JBODs.
Repeat the setup by 6 times, and it would meet your 10 PB need. Throw in a few links 10GBps per server and have them all linked up by a switch, and you got your own storage setup. May be Minio(I have no experience with it) or something like that would give you a S3 interface over the whole thing.
I bet it would come out much cheaper than AWS. But, you’ve got to get your hands dirty a bit with system in work, and automate all the things with a tool like Ansible. Having done it, I’d say it is totally worth it at your scale.

👤 plank_time

Why do you need all 10PB accessible? Have you analyzed your usage pattern to see if you really need that much data accessible? This seems so unlikely and could solve most of your problems if you change the parameters.

👤 Tepix

It seems to me like you could save a ton of money by using your own hardware. Perhaps buy a bunch of big Synology boxes? At that scale you should also consider looking at technologies such as Ceph.
We've recently switched to a setup with several Synology boxes for around 1PB net storage.

👤 timr

At this scale, there's no one perfect answer. You need to consider your usage patterns, business needs, etc.
Is the data cold storage, that is rarely accessed? Is it OK to risk losing a percentage of it? Can you identify that percentage? If it's actively utilized, is it all used, or just a subset? Which subset? How much data is added every day? How much is deleted? What are the I/O patterns?
Etc.
I have direct experience moving big cloud datasets to on-site storage (in my case, RAID arrays), but it was a situation where the data had a long-tail usage pattern, and it didn't really matter if some was lost. YMMV.

👤 erulabs

I’d go with Ceph and dedicated hardware. Something like Hetzner or Datapacket, or built it yourself and go big with something like SoftIron. We’ve built and maintain a number of these types of clusters - using S3 compatible APIs (CephObjectStore). SoftIron is probably overkill but good lord is it fun to play with that much thruput!
If you’re looking for a partner/consultant to get things going, feel free to reach out! This stuff is sort of our wheelhouse, as me and my co-founder were previously Ops at Imgur, you can imagine the kinds of image hosting problems we’ve seen :P

👤 creiht

Late to the party, but one does not simply store 10PB of data :)
The short story is, ignore most of the advice, poach^H^H^H^H^Hhire someone who has done this, and leverage their expertise. There is no armchair quarterbacking infrastructure at this scale.

👤 msk20

I don't really know much about optimizing storage costs, But You could learn from storage giants.
Example is Blackblaze storage pod 6.0 according to them it holds 0.5PB with a cost of 10k$, you will need about 20*10K$ = 200K$ + Maintenance(They also publish failure rates) , The schematics and everything is in their website and according to them they have already a supplier who provides them with such devices which you could probably buy from. Note: This was published 2016, they probably have Pod 7.0 by now so cost may be better.
Reference: https://www.backblaze.com/blog/open-source-data-storage-serv...

👤 qeternity

Are you fundamentally a data storage business or are you another business that happens to store a tremendous amount of data?
If it's the former, then investing in-house might make sense (a la Dropbox's reverse course).

👤 miouge

Cloud or self-hosted will depend on your in-house expertise. For cloud others have already mentioned Backblaze and Wasabi, but you can also check Scaleway, they do 0.02 EUR/GB/mo for hot storage and 0.002/GB/mo for cold storage.
Since we're talking about images and videos, do you already have different quality of each media available? Maybe thumbnail, high quality, and full quality. It could allow you to use cold storage for the full quality media, serving the high quality version while waiting for retrieval.
If the use case is more of a backup/restore service and a restore typically takes longer than a cold storage retrieval (being Glacier or self hosted tape robot), then keep just enough in S3 to restore while you wait for the retrieval of the rest.
If you go the self-hosted route, I like software that is flexible around hardware failures. Something that will rebalance automatically and reduce the total capacity of the cluster, rather than require you to swap the drive ASAP. That way you can keep batch all the hardware swapping/RMA once per week/month/quarter.

👤 laurensr

Also have a look at the Datahoarder community [1] on Reddit. Some people are storing astronomical amounts of data. [1]: https://www.reddit.com/r/DataHoarder/

👤 nknealk

How firm are your "less than 1000ms" requirements. Could you identify a subset of your images/videos that are very unlikely to ever be accessed and move those to s3 glacier and price in that some fractional percentage will require expedited retrieval costs?

👤 ransom1538

Netapp. If you are managing it yourself do not accept alternatives.
https://www.ebay.com/itm/313012077673?_trkparms=aid%3D111000...

👤 amacneil

At that level of data you should be negotiating with the 3 largest cloud providers, and going with whoever gives you the best deal. You can negotiate the storage costs and also egress.

👤 ZeroCool2u

Take any credits you can get from a provider switch and then thoroughly map out your access patterns, ingestion, and egress. Do whatever you can to segment data by your needs for availability and modification.
If it's all archival storage then it's pretty straight forward. If you're on GCP you take it all and dump it into archival single region DRA (Durable Reduced Availability) storage for the lowest costs.
Otherwise, identify your segments and figure out a strategy for "load balancing" between standard, nearline, coldline, and archive storage classes. If you can figure out a chronological pattern, you can write a small script that uses the gsutils built-in rsync feature to mirror over data from a higher grade storage class to a lower one at the right time.
The strategy will probably be similar in any of the other big 3 providers as well, but fair warning, some providers archival grade storage does not have immediate availability last I checked.
See: https://cloud.google.com/storage/docs/storage-classes
https://cloud.google.com/storage/docs/gsutil/commands/rsync

👤 giantg2

Agree with someone else's comment questioning how is the data ingested and used.
10PB seems like a lot to store in S3 buckets. I assume much of that data is not accessed frequently or would be used in a big data scenario. Maybe some other services like Glacier or RedShift (I think).

👤 ufmace

10PB is a crazy amount of data. Far more than any normal business would ever have to deal with. Presuming you aren't crazy, you must have an unusual business plan to legitimately need to handle that much data. That means it's tough for us to say much - any assumptions we might have about it could be invalid depending on your actual business needs. You're just going to have to tell us some more about your business case before we can say anything useful about it.

👤 anij

Disclaimer: I work for Nutanix
Consider looking at Nutanix - you can get the hardware from HPE (including Apollo).
Object storage from Nutanix doesn’t even break a sweat at 10PB of usable storage.
However the main reasons to look at Nutanix would be ease of use for
day 0 (bootstrapping) day 1 (administration operations, capacity management), fault tolerance and day n operations (upgrades, security patches etc)
Nutanix spends considerable time and resources on all this to make life of our customers easy.

👤 DSingularity

Amazing how one post will tell you that, at your scale, S3 is stupid and other posts will tell you that at your not-small-enough-and-yet not-big-enough scale S3 is the only option. I say stick with cloud. If cost is an issue go negotiate a better contract — GCP will probably give you a nice discount. Setting up a highly available service at that scale is not a walk in the park. Can you afford the distractions from your primary app while you figure it out?

👤 byteshock

Wasabi is a good option. They’re S3 compatible and don’t charge any egress or ingress fees. Been using them for a few years. Great speeds and customer support.

👤 throwaway823882

1. Shrink your data. That's just an absurd amount of data for a start-up. Even large organizations can't quickly work around too much data. Resource growth directly affects system performance and complexity and limits what you will be able to practically do with the data. You already have a million problems as a start-up, don't make another one for yourself by trying to find a clever solution when you can just get rid of the problem.
2. As a general-purpose alternative, I would use Backblaze. It's cheap and they know what they're doing. Here is a comparison of (non-personal) cloud vendor storage prices: https://gist.github.com/peterwwillis/83a4636476f01852dc2b670...
3. You need to know how the architecture impacts the storage costs. There are costs for incoming traffic, outgoing traffic, intra-zone traffic, storage costs, archive costs, 'access' costs (cost per GET | POST | etc). You may end up paying $500K a month just to serve files smaller than 1KB.
4. You need to match up availability and performance requirements against providers' guarantees, and then measure a real-world performance test over a month. Some providers enforce rate limits, with others you might be in a shared pool of rate limits.
5. You need to verify the logistics for backup and restore. For 10PB you're gonna need an option to mail physical drives/tapes. Ensure that process works if you want to keep the data around.
6. Don't become your own storage provider. Unless you have a ton of time and money and engineering talent to waste and don't want to ship a reliable product soon.

👤 mattgair

SoftIron would love to help with this project. We're in your backyard and could have POC on your hands in no time at all, and full 10PB in about 6 weeks. matt@softiron.com

👤 hamburga

Meta-question: shouldn't there be a website dedicated specifically to reliable, crowd-sourced answers to questions like these? Does it really not exist? I'm thinking like StackShare, but you start from "What's the problem I'm trying to solve?", not "What products are big companies using?".

👤 msoad

Having dealt with a lot of big data I often came to realization that we actually did not need most of it.
Try being intentional and smart in front of your data pipeline and purge data that is not useful. Too many times people store data "just in case" and that case never happens years later.

👤 lokl

You wrote, "data needs to be accessible at all time ... less than 1000ms" latency, but this does not tell the whole story about accessibility/latency. Does your use case allow you to do something similar to lazy loading, where you serve reduced quality images/video at low latency and only offer the full quality on demand/as needed with greater latency? For example, initially serve a reduced-resolution or reduced-length video instead of the full-res/full-length original, which you keep in colder storage at a reduced cost? Depending on the details of what is permissible and data characteristics, this approach might save you a lot overall by reducing warm storage costs.

👤 XorNot

I'm wondering here if this data is currently oversized? If the use case is all mobile, has your product committed to losslessly storing something or not?
While there's definitely a cross-over point where you should roll your own, the overhead costs of running a storage cluster reliably (and all the problem you don't really have to deal with because they're outsourced to AWS) mean it might be a better use of time and effort to see how much you can cut that number down by changing the parameters of your storage. The immediate savings will be much easier to justify.
Keep in mind you've also got a migration problem: getting 10PB off Amazon is not a simple, handsfree project.

👤 zmmmmm

My only comment is that I have a hard time reconciling these two statements:
> downloaded in the background to a mobile phone
and
> but less than 1000ms required
I'm struggling to think of what kind of application needs data access in the background with latency of less than 1000ms. That would normally be for interactive use of some kind.
Getting to 1 min access time would get you into the S3 glacier territory ... you will obviously have considered this but I feel like some really hard scrutiny on requirements could be critical here. With intelligent tiering and smart software you might make a near order of magnitude difference in cost and lose almost no user-perceptible functionality.

👤 Dylan16807

> Should have mentioned earlier, data needs to be accessible at all time. It’s user generated data that is downloaded in the background to a mobile phone, so super low latency is not important, but less than 1000ms required.
> The data is all images and videos, and no queries need to be performed on the data.
Okay, this is a good start, but there are some other important factors.
For every PB of data, how much bandwidth is used in a month, and what percentage of the data is actually accessed?
Annoyingly, the services that have the best warm/"cold" storage offerings also tend to be the services that overcharge the most for bandwidth.

👤 fvv

Need more details.. maybe a graph (or several graphs ) of requests \ day for various items (categorized by popularity and size is ok ) (a curve ( i suppose not very hyperbolic) to breakdown populary of top requested items vs long tail of almost never seen, and rarely seen which i suppose is the most 9f those 10pb ) and current bandwidth intersection ( and data size ) and volume , this is too have an idea about bw, iops ,structure of the data and requests patterns and requirements and caching layer , i think that probably a share fs is worse than distributed blobl storage here ( assuming spinning disks somewhere and not huge caches ) Not all days usage patterns are equal, your requirement are different from database (which is more in line with some suggestions here ) Plus data safety is everything for your kind of business so redoundancy is a must , speed too (don't even think about filecoin imho) i would think more about a mix of spinning and name as cache layer redoundant on multiple datacenter if it's to save costs.. if it's to save efforts and a bit of costs look at ovh offerings for blob storage services or contact backblaze for a custom solution hosted by them ?
Plus here we are not talking about 10pb but probably at 25 given redoundancy and probably also at 100pb ad more given the assumption that your company is growing , so a solution that cost slightly less today but will only do 2x when you do 10x would still be very interesting imo.. there is a lot to talk about ;)

👤 Keverw

I have a startup idea and want to make sure it scales, I was thinking S3 but don't like vendor lock-in. Not that far along yet, I was thinking maybe SeaweedFS or even going crazy enough to write my own storage system. Use a database like CockroachDB or MongoDB to store the meta data, and then replica pieces of the file to "chunk servers". However cleaning up deleted files, etc seem a bit of a pain. I was thinking instead of top down, let each node contain a copy of the metadata and scan on each node individually instead of the central database trying to manage each node. Then have a a process to handle under replicated files. However if you can adjust the number of replicas for say a popular file, you'd need to then coordinate which extra copies to remove when scaling down. Maybe a bit optimistic.
Kinda disappointed the file solutions seem more complicated and nothing more simple to setup like some of the new databases are like CockroachDB or MongoDB are to use. I feel like reinventing the wheel is kinda bad as rather let people who are more experts in this field handle this stuff, but I hate the idea of vendor lock-in and forced to use other peoples servers, self hosting be nice from a single node to test to a cluster spanning multiple datacenters. Maybe there's a solution out there, I done some searching and just seems to go in circles. I seen one system but if you wanted to add or remove nodes in the future, you couldn't just "drain" a chunk server by moving it data.

👤 stlava

If data storage isn't your startup's job then I would negotiate heavily on the AWS contract.

👤 immnn

At startup grade, it‘s fine to stick and grow with IaaS provider like Amazon, Google, Microsoft, Oracle or whatever you like.
However, you‘ll get to a point, where it‘s crucial to become profitable. And storing that much data does cost a lot of money using one of the mentioned providers.
So, when you think it‘s the right time to become “mature”, then get your own servers up and running using colocation.
What options do you have here (just a quick brainstorm): 1. Set up some servers, put in a lot of hard drives, format them using zfs and make it available using nfs on your network 2. Get some storage servers 3. Set up a Ceph cluster
I used to work as a CTO at a hosting company and evaluated all of these options and more. Every of these options comes with pros and cons.
Just one last advice: Evaluate your options and get some external help on this. Any of these options have pitfalls and you need experienced consultants to set up and run such an infrastructure.
All in all, it’s an invest, that will save you a lot of money and will give you freedom and flexibility to grow further.
P.S. we ended up setting up a Ceph cluster. We found a partner, who’s specialized on hosting custom infrastructures. That partner is responsible for all the maintenance, so we could focus on the product itself.

👤 howeyc

If you want to stick with cloud, then stick with what you're doing or migrate to a cheaper alternative like wasabi, backblaze, etc.
If you're not afraid of having a few operations people on staff and running a few racks in multiple data centers, then buy a bunch of drives and servers and install something to expose everything via S3 interface (Ceph, Minio, ...) so none of your tools have to change.

👤 super3

If you put the data on Storj DCS, it would run about $40k/month for list pricing with global availability and encryption. I'm sure you could get a deal if you asked though. It has S3 compatibility, so would be plug and play with whatever you have now. Egress out of AWS would be free.
Way cheaper than AWS, and a lot less headache than trying to run it all yourself.

👤 edoceo

is this a case where GlusterFS and ZFS would work? I dont have PB of data, but many TBs. Gluster nodes are spread around globe, use ZFS for the "brick" and then the Gluster magic gives me distribute / replica.
surprised I didn't see Gluster already in this thread. maybe its not for such big scale?
edit: Wikipedia says " GlusterFS to scale up to several petabytes on commodity hardware"

👤 bonoboTP

Check whether you really need 10 PB or you can make do with several orders of magnitude less. I wouldn't be surprised if it was some sort of perverse incentive CV building thing, like engineers building a Kubernetes cluster for every tiny thing. If you really do need 10 PB, then still you probably should check again because you probably don't need 10 PB.

👤 ignoramous

In cloud:
Wasabi's Reserved Capacity Storage is likely to be the cheapest: https://wasabi.com/rcs/
If you front it with Cloudflare, egress would be close to free given both these companies are part of the Bandwidth Alliance: https://www.cloudflare.com/bandwidth-alliance/
Cloudflare has an images product in closed beta, but that is likely unnecessary and probably expensive for your usecase: https://blog.cloudflare.com/announcing-cloudflare-images-bet...
--
If you're curious still, take a look at Facebook's F4 (generic blob store) and Haystack (for IO bound image workloads) designs: https://archive.is/49GUM

👤 jkingsbery

Besides what others have asked:
What are your access patterns? You say "no queries need to be performed," but are you accessing via key-value look-ups? Or ranged look-ups?
What do customers do with the pictures? Do customers browse through images and videos?
You mention it's "user generated data" - how many users (order of magnitude)? How often is new data generated? Does the dataset grow, or can you evict older images/videos (so you have a moving window of data through time)?
Besides your immediate needs, what other needs do you anticipate? (Will you need to do ML/Analytics work on the data in the future? Will you want to generate thumbnails from the existing data set?)
What my experience is based on: I was formerly Senior Software Engineer/Principal Engineer for a team that managed reporting tools for internal reporting of Amazon's Retail data. The team I was on provides tools for accessing several years worth of Amazon.com's order/shipment data.

👤 ecesena

S3 + Glacier. For data you're accessing via Spark/Presto/Hive I believe Parquet is a good format. At your scale AWS should prob provide discounts, worth connecting w/ an account rep.
I'd recommend reaching out to some data eng in the various Bigs, they certainly have more clear numbers. Happy to make an intro if you need, feel free to dm me.

👤 dublin

Actual answer: There is almost NO company that really needs that much data. This has mostly just become a pissing match. In general, companies (especially startups) are way better off making sure they have a small amount of high-quality, accurate, data than a huge pile-o-dung that they think they're going to use magical AI/ML pixie dust to do something with.
That said, if you really think you must, spend effort on good deduping/transcoding (relatively easy with images/video), and consider some far lower-cost storage options than S3, which is pretty pricey no matter what you do. If S3 is a good fit, I hear good things about Wasabi, but haven't used it myself.
If you have the technical ability (non-trivial, you need someone who really understands, disk and system I/O, RAID Controllers, PCI lane optimization, SAN protocols and network performance (not just IP), etc.) and the wherewithal to invest, then putting this on good hardware with something like say, ZFS at your site or a good co-lo will be WAY cheaper and probably offer higher performance than any other option, especially combined with serious deduping. (Look carefully at everything that comes in once and you never have to do it again.) Also, keep in mind that even-numbered RAID levels can make more sense for video streaming, if that's a big part of the mix.
The MAIN thing: Keep in mind that understanding your data flows is way more important than just "designing for scale". And really try to not need so much data in the first place.
(Aside: I'm was cofounder and chief technologist of one of the first onsite storage service providers - we built a screamer of a storage system that was 3-4x as fast, and scaled 10x larger than IBM's fastest Shark array, at less than 10% of the cost. The bad news - we were planning to launch the week of 9/11 and, as self-funded, ran out of money before the economy came back. The system kicked ass, though.)

👤 Icer5k

As others have said, it’s a complicated question, but if you have the resources/wherewithal to run Ceph but don’t want to deal with co-location, you can get a bunch of storage servers from Hetzner and get a much better grasp on cost over S3.
For example, at 10PB with every object duplicated twice (so 20 PB raw storage), you’d need ~90 of their SX293[1] boxes, coming out to around €30k/mo. This doesn’t include time to configure/maintain on your end, but it does cover any costs associated with drive replacement for failure.
I’ve done similar setups for cheap video storage & CDN origin systems before, and it’s worked fairly well if you’re cost conscious.
[1] https://www.hetzner.com/dedicated-rootserver/sx293/configura...

👤 itroot

It's a complex question. I had experience of working with ~60petabytish system back in 2016, and there a lot of things to cover (not only storage):
* network access - do you have data that will be accessed frequently, and with high traffic? You need to cover this skewed access pattern in your solution.
* data migration from one node to another, etc...
* ability to restore quickly in case of failure.
I would suggest to:
* use some open-source solution on top of the hosted infrastructure (Hetzner or similar is a good choice)
* bring in a seasoned expert to analyze your data usage/storage patterns, maybe there are some other ways to make storage more cost effective, that simply moving out of AWS S3.

👤 rvr_

Try https://min.io/ I would 100% go for it if my company was not a https://www.caringo.com/products/swarm customer

👤 dpa42

I'd like to echo an suggestion I read earlier in this thread: at this scale (i.e. yearly spent), talk to AWS, GCP, Azure or a reseller of your trust and get a good deal to compare your other options with.
Disclaimer: I'm working at a consultancy/partner for a competing cloud.

👤 PLenz

I would consider moving to my own metal and using hadoop.

👤 tux

Maybe take a look at BackBlaze Storage Pods;
https://www.backblaze.com/blog/open-source-data-storage-serv...
There Storage Pod 6.0 can hold up to 480TB per server.

👤 chrislusf

I am working on SeaweedFS. It was originally designed to store images as Facebook Haystack paper, and should be ideal for your use case. See https://github.com/chrislusf/seaweedfs
And it already supports S3 API, and other HTTP, FUSE, WebDAV, Hadoop, etc.
There should be many existing hardware options that is much cheaper than AWS S3.

👤 teitoklien

I would go for something like Wasabi cloud storage ,
It’s api is S3 compliant.
And also I believe they have minimal cost for transferring data from S3 into wasabi , so initial setup cost should be lower too.
It should be relatively cheaper than self hosting too , when you account for hidden costs that comes with self hosting , related to managing additional employees , having protocols in place for recovering from faults , expanding the storage as you go , maintaining existing infrastructure , etc.
You can compare the prices with respect to S3 at
(https://wasabi.com/cloud-storage-pricing/#cost-estimates)

👤 ilc

Look at the cost of moving out of the cloud carefully.
Can you afford the up-front costs of the hardware needed to run the solutions you may want to run?
Will those solutions have good enough data locality to be useful to you?
It isn't real useful to have all your data on-site, and then you operations in the cloud. You've introduced many new layers that can fail.
If you go on-prem, the solution to look at is likely Ceph.
Source: Storage Software Engineer, who has spoken at SNIA SDC. I currently maintain a "small" 1PB ceph cluster at work.
Recommendation: Get someone who knows storage and systems engineering to work with you on the project. Even if you decide not to move, understanding why is the most important part.

👤 oneplane

If I were in your shoes I'd still host it on AWS, unless your shoes have a problem with the AWS bill, but then you run into other problems:
- Paying for physical space and facilities
- Paying people to maintain it
- Paying for DRP/BCP
- Paying periodically since it doesn't last forever so it'll need replacements
But if you were to have to move out of AWS but Azure and GCP aren't options, you can do: Ceph and HDDs. Dual copies of files so you have to lose three drives for any specific file to have (only those files) dataloss. Does not come with versioning or full IAM-style access control or webservers for static files (which you get 'for free' with S3).
HDDs don't need to be in servers, they can be in drive racks, connected with SAS or iSCSI to servers. This means you only need a few nodes to control many harddisks.
A more integrated option would be (As suggested) back blaze pod-style enclosures, or storinator type top loaders (supermicro has those too). It's generally 4U rack units for 40 to 60 3.5" drives, which again generally comes to about 1PB per 4U. A 48U rack holds 11 units when using side-mounted PDUs, a single top-of-rack switch and no environmental monitoring in the rack (and no electronic access control - no space!).
This means that for redundancy you'd need 3 racks of 10 units. If availability isn't a problem (1 rack down == entire service down) you can do 1 rack. If availability is important enough that you don't want downtime for maintenance, you need at least 2 racks. Cost will be about 510k USD per rack. Lifetime is about 5 to 6 years but you'll have to replace dead drives almost every day at that volume, which means an additional 2000 drives over the lifespan, perhaps some RAM will fail too, and maybe one or two HBAs, NICs and a few SFPs. That's about 1.500.000 spare parts over the life of the hardware, not including the racks themselves, not including power, cooling or physical facilities to locate them.
Note: all of the figures above are 'prosumer' class and semi-DIY. There are vendors that will support you partially, but that is an additional cost.
I'm probably repeating myself (and others) here, but unless you happen to already have most of this (say: the people, skills, experience, knowledge, facilities, money upfront and money during its lifecycle), this is a bad idea and 10PB isn't nearly enough to do by yourself 'for cheaper'. You'd have to get into the 100PB or more arena to 'start' with this stuff if you need to get all of those externalities covered as well (unless it happens to be your core business, which from the opening post it doesn't seem to be).
A rough S3 IA 1Z calculation shows a worst-case cost of about 150.000 USD monthly, but at that rate you can get a lot of cost savings, and with some smart lifecycle configuration you can get that down as well. This means that doing it yourself vs. letting AWS do it makes AWS half as expensive.
Calculation as follows:
DIY: at least 3 racks to match AWS IA OneZone (you'd need 3 racks on 3 different locations, a total of 9 racks to have 3 zones but we're not doing that as per your request) which means the initial starting cost is a minimum of 1.530.000 and combined with a lifetime cost of at least 1.500.000, over 5 years, if we're lucky, so about 606.000 per year, just for the contents of racks that you have to already have.
Adding to this, you'd have some average colocation costs, no matter if you have an entire room, a private cage or a shared corridor. That's at least 160U and in total at least 1400VA per 4U (or roughly 14A at 120V). That amount of power is what a third of a normal rack might use on its own! Roughly, that will boil down to a monthly racking cost of 1300USD per 4U if you use one of those colocation facilities. That's another ~45k per month, at the very least.
So no-personnel colocated can be done, but doing all that stuff 'externally' is expensive, about 95.500 every month, with no scalability, no real security, no web services or load balancing etc.
That means below-par features gets you a rough saving of 50k monthly if you didn't need any personnel and nothing breaks 'more' than usual. And you'd have to not use any other features in S3 besides storage. And if you use anything outside of the datacenter you're located (i.e. if you host an app in AWS EC2, ECS or a lambda or something) and you need a reasonable pipe between your storage and the app, that's a couple of K's per month you can add, eating into the perceived savings.

👤 chubot

Why not downsample everything to 10% the size, put those online, and use Amazon Glacier for the originals? (e.g. for exporting)
If you're storing images and videos directly from the phone, they can be downsampled drastically without losing quality on a viewing device that anyone's likely to have.
It's unlikely that anyone wants to download the full size copy, and if they do, they can wait a few hours for Glacier.
You could expose this to the customer, e.g. offer direct access of originals at 2x or 5x the price. But 99.9% of people will be OK with immediate access to quality images/video and eventual access to the unmodified originals.

👤 Rafuino

Perhaps look into Vast Data? They have a TCO calculator [1] but it seems to compare to other on-prem data storage providers (like Isolon...). 10PB in One Zone IA costs $100,000/mo without discount, or $1.2M per year, and that's just for storage alone. Vast claims something like $3.5M TCO over 5 years with 10PB of data and no growth assumption. 5 years on your S3 zone with no data growth (or transfer...) is $6M.
[1] https://vastdata.com/tco-calculator/

👤 pkb

1) For hardware you want cheap, expendable, bare metal. Look up posts about how Google built their own servers for reference. 2) For RAID, go with software only RAID. You will sidestep problems caused by hardware RAID controllers having custom data format each (i.e. non-swapable for different model/make). 3) For filesystem, look for OpenAFS. CERN is using OpenAFS to store petabytes of data from LHC. 4) For operating system, look at Debian. Coupled with FAI (fully automated installation), it will enable you to deploy multiple servers in an automated way, to host your files.

👤 SergeAx

With a volume like that you should negotiate at least three storage+CDN providers and see who will give you the best offer. It could be as much as 50% off street price and even more if you are ready to sign a 2-3 years contract.
I personally would consider S3 Glacier+CloudFront, member of Bandwidth Alliance [0] of your choice+CloudFlare, and whomever serves TikTok now.
[0] https://www.cloudflare.com/en-gb/bandwidth-alliance/

👤 kissgyorgy

I would buy commodity hardware and build my own storage cluster with ZFS and just put Minio in Distributed mode on it. You have full control of redundancy levels either on the cluster on individual ZFS pool side and can fine-tune what your business needs. Maybe you don't need to mirror all the data so you can have RAIDZ2 with just 20-30% extra cost.
Hiring staff to build this would make sense at this point, because if your S3 storage cost is really $200,000/month, you can hire 3 good engineers with $450,000/year, which is the cost of just two months of S3 storage.

👤 speedgoose

I strongly recommend having more than one zone. A datacenter being offline for a while or totally burning is possible. It did happen a few weeks ago and a lot of companies learnt the value of multi zones the hard way.

👤 verdverm

It definitely depends on how you accumulate and the usage patterns. More clarity is needed there to make recommendations.
As an aside, you can often get nice credits for moving off of AWS to Azure or GCP. I recommend the later.

👤 peterthehacker

Can you elaborate on what the >10PB of data is and why it’s important to your startup? Is it archived customer data, like backups? Or is it data purchased from vendors for analysis and ML?

👤 joepour

Hey Philip,
We store north of 2PB with AWS and have just committed to an agreement that will increase that commitment based on some competitive pricing they've given us.
Give me a shout if you'd like to chat.

👤 up2isomorphism

I have designed, deployed and supported an S3 compatible storage with 5PB capacity for a couple of years, so I acquired the experience to put right hardware and software together together to build such a storage system. And the cost reduction compared to public could like AWS is tremendous. If you are interested in building a private cloud storage for you own, you can contact me at hackernewsantispam@gmail.com for a more detailed discussion.

👤 bullen

My high level view is that if you are storing that much content, most of it is bad, so the solution for me would be to delete it!
As for my own storage I use 1TB SanDisc SD cards in a raspberry 2 cluster for write once data (user) and 8x64GB 50nm SATA drives from 2011 on 2xAtom 8-core for data that changes all the time! Xo
People say that content is king, I think that final technology (systems that don't need rewriting ever) is king and content has peaked! ;)

👤 gigatexal

Latency being time to first byte downloaded I’d still store this in cloud somewhere so that the really “hot” images/videos could be cached in a cloudfront CDN or something.
Also this is a startup, no? A million or so in storage so you need not preoccupy your startup with having to deal with failing disks, disk provisioning, collocation costs, etc. etc. not to mention the 11 9s of durability you get with S3, to me it just makes the most sense to do this on the cloud.

👤 pmorici

I'd look at using a Storinator cluster with a scalable network filesystem like Gluster, Lustre, Ceph or something along those lines. A 4U Storinator with 60 18TB drives has 1PB of raw capacity and cost $43,0000. You'd been looking at a upfront cost of $500k but if you amortize that cost over a 5 year period you are looking at 100k per year plus you are going to need someone that dedicates an amount of time to maintaining that.

👤 offtop5

If AWS is what you know I'd stick with it.
Changing that can be very very difficult for not much gain. Plus AWS skills are very easy to recruit for vs Google cloud.

👤 miga

By moving from AWS to a cheaper backup storage provider like B2 you would get costs from 200k$ to 50k$ per month.
There is S3-like interface, so you may just change access key, and region host: https://www.backblaze.com/b2/docs/s3_compatible_api.html

👤 davgoldin

My previous startup (~2014) had a similar problem: PBs of data, with millions of mixed clients accessing it at close to real-time speeds. The biggest difference is that we needed to do real-time processing before delivering the content. We needed storage capacity balanced with CPU and RAM.
We ended up buying lots of Supermicro's ultra dense servers [1]. That's a 3U box, containing 24 servers that are interconnected with internal switches (think: 1 box is a self-contained mini cloud). Each server has (cheap config) 1 CPU 4 Xeon cores, 32GB ram, 4TB disk.
Those were bought & hosted in China, and IIRC price tag was around $20k USD per box. That's 96TB per 3U, or >1.2PB and ~$200k per rack. We had a lot of racks in multiple datacenters. These days capacity can be much larger, e.g.: 6TB disk, 144TB per 3U and >1.8PB per rack.
We've tried Ceph, GlusterFS, HDFS, even early versions of Citus, and pretty much everything that existed and was maintain during that time. We eventually settled on Cassandra. It required 2 people to maintain the software, and 1 for the hardware.
Today, I would have done the same hardware setup, mainly because I haven't had 1 Supermicro component fail on me since I bought them first in early 2000s. Cassandra would've been replaced by FoundationDB. I've been using FoundationBD for awhile now, and it just works: zero maintenance, incredible speeds, multi datacenter replication, etc.
Alternatively, if I needed storage without processing, but with fast access, I'd probably go with Supermicro's 4U 90 bay pods [2]. That'd be 90*16TB, 1.4PB in 4U, or ~14PB per rack. And FoudnationDB, no doubt.
As a fun aside: back then, we also tried Kinetic Ethernet Attached Storage [3]. Great idea but what a pain in the rear it was. We did however have a very early access device. No idea if it's still in production or not.
[1] https://www.supermicro.com/en/products/system/3U/5038/SYS-50...
[2] https://www.supermicro.com/en/products/system/4U/6048/SSG-60...
[3] https://www.supermicro.com/products/nfo/files/storage/d_SSG-...

👤 jcalabro

I've used Wasabi a ton in the past and it's been excellent. It's already been talked about a lot in this thread, but I haven't seen their marketing video[0] linked, and it's pretty funny so I thought I'd leave it here!
https://www.youtube.com/watch?v=P7OzyTG4fCM

👤 paulmd

Tape, if it fits your storage needs. You won't beat the cost of tape if you are doing cold storage.
For online or nearline storage, you should look at what Backblaze did. Either buy hardware that is similar to what they did (basically disk shelves, you can cram ~100 drives into a 4U chassis) or if you are at that scale you can probably build your own just like they did.

👤 ForHackernews

Have you considered deleting most of it?
Chances are you don't need all of it. Every company today thinks they need "Big Data" to do their theoretical magic machine learning, but most of them are wrong. Hoarding petabytes of worthless data doesn't make you Facebook.
To be a little less glib, I'd start by auditing how much of that 10PB actually matters to anyone.

👤 helge9210

For on-premises storage (without managing storage racks and Ceph yourself) you can look at Infinibox (https://www.infinidat.com/en/products-technology/infinibox).
(I'm not working there anymore, posting this just to help)

👤 philippb

i just wanted to thank everyone for taking the time to reply. This has been way better input than I thought it would turn out.

👤 louwrentius

Ceph is a beast and will require at least 2-3 technicians with intricate Ceph knowledge to run multiple (!) Ceph clusters in a business continuity responsible manner.
Because you must be able to deal with Ceph quirks.
If you can shard your data over multiple independent stand-alone ZFS boxes, that would be much simpler and more robust. But it might not scale like Ceph.

👤 sparrc

Have you tried backblaze b2 storage? Requires more work client-side but is around 1/4 to 1/5 the price.
The only issue is whether or not you have a CDN in front of this data. If you do then backblaze might not be much cheaper than S3->Cloudfront. You'd save storage costs but easily exceed those savings in egress.

👤 silviot

I think if I _had_ to decide (I'm not the best informed person on the matter) I'd lean towards leofs[1].
I only read about it, but never used it.
It advertises itself as exabyte scalable and provides s3 and nfs access.
[1] https://leo-project.net/leofs/

👤 sandreas

If someone needs even more background:
http://web.archive.org/web/20201128103953/https://blog.ampli...

👤 znpy

You can buy an appliance from Cloudian and have you S3 on-premise and support.
They're basically 100% S3-compatible.
I don't know the details of their pricing, but they're production grade in the reald sense of the word.
I am not affiliated with them in any way, but I interviewed with them a couple of years ago and left with a good impression.

👤 christophilus

Wasabi + BunnyCDN has worked like a charm for us. We've got about 50TB there, if I recall. Our bill is dramatically smaller than when we were on AWS. Wasabi has had some issues-- notably a DNS snafu that took the service out for about 8 hours, if I recall. But over all, the savings have been worth it.

👤 u678u

Sounds like a standard business problem, make a spec and get the main 20 cloud providers to submit bids.

👤 glitchc

Compression is always a good alternative, which is especially effective when modification is infrequent.

👤 joering2

It would be cool to actually have a "blockchain" for something like this. I know the huge amount of data to be store is a niche market, but hear me out:
Everyone that wants to make extra money can join
You join with your computer hooked up to internet, a piece of software running in background
You share % of your hard-drive and limit speed that can be used to upload/download
When someone needs to store 100PB of data ("uploader"), they submit a "contract" on a blockchain - they also setup what's the redundancy rate, meaning how many copies need to be spread to guarantee consistency of data as a whole
The "uploader" shares a file - the file is being chop in chunks and each chunk being encrypted with uploader private PHP key. The info re chunks are uploaded to blockchain and everyone get a piece. In return, all parties that keep piece of uploader data get paid small % either via PayPal or simply in crypto.
I think that would be a cool project, but someone would have to do back-of-napkin number crunching if that would be profitable enough to data hoarders :)

👤 plint

I'm curious why distributed cloud storage systems such as filecoin haven't been mentioned as a possible solution. Estimates of cost of storage that I saw on "file.app" put it at something like 100x cheaper than S3.
Not worth the risk or why?

👤 Charon77

Not an experience. But if I was given the task, I'll probably think about how those data could be distributed. Maybe use my own instance of IPFS, so each 'node' don't have to store all of the data.

👤 rasz

Just run another venture round and dont think too hard about this problem. If everything goes well it wont be your problem for much longer, if it goes bad then who cares anyway.

👤 hemmert

I happen to own exa-byte.com, in case you need a domain for it ;-)
(In 1998, in school, I looked up in our math book what would come after mega, giga... 20 years later, just as fresh and useless as on day one ;))

👤 plasma

Have you looked into the storage tiering (eg moving objects to glacier) for less active users?
Perhaps it’s a mix of some app pattern changes and leveraging the storage tier options in AWS to reduce your cost.

👤 brudgers

Is the storage of the data critical to the future growth of the business?

👤 sgt

Here's an unpopular answer - don't store 10PB of data. Find a way for your startup to work without needlessly having to store insane amounts of data that will likely never be needed.

👤 daveguy

At that scale I would contact AWS, Backblaze and Wasabi directly to see what improvements they can offer in terms of TCO (and potentially for a longer term contract).

👤 punitvara

See how file coin works. And decentralized database work. It should be the way cheaper than aws. Search for s3 like api in decentralized database and you will get you answer

👤 iamgopal

Google nearline etc cost a bit less, also coming from AWS, they may give good discount, with considering operation and Maintainance, cloud will be cheaper.

👤 hsaliak

Use Intelligent tiering or some kind of a custom system that moves data into glacier more aggressively based on access times.It can help a lot.

👤 jkuria

Always look to nature first. Nature never lies. DNA storage:
Escherichia coli, for instance, has a storage density of about 10 to the 19 bits per cubic centimeter. At that density, all the world’s current storage needs for a year could be well met by a cube of DNA measuring about one meter on a side.
There are several companies doing it: https://www.scientificamerican.com/article/dna-data-storage-...

👤 nixgeek

What happens to your business if you lose this data?

👤 royalresolved

I'm unsure if it's mature enough for your use right now (in particular, the retrieval market is undeveloped for fast access, but I wonder if you looked at filecoin?)
https://file.app/ https://docs.filecoin.io/build/powergate/
(Disclosure: I am indirectly connected to filecoin, but interested in genuine answers)

👤 coverband

Have you looked into Backblaze? They’re a lot cheaper than Amazon and have S3-compatible APIs.

👤 xnx

Off topic, but I'm shocked that anyone would trust uploading sensitive files (e.g. nudes) to this service. Photo vault type apps can be useful, but I would never want the content in those apps to upload to a small service like this based on their word that employees won't go through it.

👤 SrslyJosh

> no queries need to be performed on the data.
cat >/dev/null, obviously. ;-)

👤 siavosh

Not sure the state of some of the decentralized solutions...

👤 treeman79

Tape drives. Semi joke.
How often you access data is another question.

👤 SteveNuts

Pure Flashblade 100%
Feel free to ama on it, I'm a huge fan

👤 aaccount

how much is it costing to keep 10PB on AWS S3? according to calculator.s3.amazonaws.com is usd 200000+ per month

👤 johngalt

900 LTO-U8 tapes

👤 cuducos

I'd store in node_modules/

👤 skuhn

The right answer for you may have more to do with your business requirements than technical requirements. I've done large scale storage in cloud providers (S3, GCS, etc.) and on premise (I designed the early storage systems at Dropbox). I haven't found there to be a one-size-fits-all answer.
If you place a high value on engineering velocity and you already rely on managed services, then I would look to stay in S3. Do the legwork to gather competitive bids (GCS, Azure, maybe one second tier option) and use that in your price negotiation. Negotiation is a skill, so depending on the experience in your team, you may have better or worse results -- but it should be possible to get some traction if you engage in good faith with AWS.
There is a considerable opportunity cost to moving that data to another cloud provider. No matter how well you plan and execute it, you're going to lose some amount of velocity for at least several months. In a worse scenario, you are running two parallel systems for a considerable amount of time and have to pay that overhead cost on your engineering team's productivity. In the worst case scenario, you experience service degradation or even lose customer data. It's quite easy for 2-3 months to turn into 2-3 years when other higher priority requirements appear, and it's also easy for unknowns to pop up and complicate your migration.
With all of that said, if the fully baked cost of migrating to another cloud provider (engineering time + temporary migration services + a period of duplicated costs between services + opportunity cost) is trajectory changing for your business, then it certainly can be done. I feel like GCS is a bit better of a product vs S3, although S3 has managed to iron out some of its legacy cruft in the last few years. Azure is not my cup of tea. I have never seriously considered any other vendors in the space, although there are many.
Your other option is to build it. I've done it several times, people do it every day. You may need someone on the team who either has or can grow the skillset you're going to need: vendor negotiation, capacity planning, hardware qualification, and other operational tasks. You can save a bunch of money, but the opportunity cost can be even greater.
10PB is the equivalent of maybe 1-2 rack of servers in a world where you can easily get 40-50 drive systems with 10-18TB drives (of course for redundancy you would need more like 2-2.5x, and you need space to grow into so that you're always ahead of your user growth curve). At any rate, my point is that the deployment isn't particularly large, so you aren't going to see good economies of scale. If you expect to be in the 100+PB range in 6-12 months, this could still be the right option.
Personally, I would look to build a service like this in S3 and migrate to on-premise at an inflection point probably 2 orders of magnitude above yours, if the future growth curve dictated it. The migration time and cost will be even more onerous, but the flexibility while finding product/market fit probably countermands the cost overhead.
There is a third option, which is hosted storage where someone else runs the machines for you. Personally I see it as a stop-gap solution on the path to running the machines yourself, and so it's not very exciting. But it is a way to minimize your investment before fully committing.

👤 jeffrallen

On tape.

👤 gamedna

Context please.
1. Do you have paying customers already?
2. Can the startup weather large capex? does opex work better for you?
3. Do you already have staff with sufficient bandwidth to support this, or will you need to hire?
4. What are the access patterns for the data?
5. What is the data growth rate?
6. What is the cost of losing some, or all of this data?
7. What is your expected ROI?
TL;DR - storing and serving up the data is the easy part.

👤 tusqasi

Talk with Linus at LTT.

👤 acd

Using Erasure coding.

👤 water8

Gusterfs + ZFS

👤 neverartful

I'm way late to the conversation. There are a few things that I haven't seen mentioned (apologies if I overlooked them).

I have no idea how you evaluate the necessity of keeping the data safe, and that plays a huge factor in deciding what's appropriate. Amazon S3 makes it a no-brainer for having your data safe across failure domains. Of course, the same can be done with non-S3 solutions, but someone has to set it all up, test it, and pay for it.

My background in storage is mostly related to working with Ceph and Swift (both OpenStack Swift and SwiftStack) while being employed by various hardware vendors.

Some thoughts on Ceph: - In my opinion, Ceph is better suited for block storage than object storage. To be fair, it does support object storage with use of the Rados Gateway (RGW) and RGW does support the S3 API. However, Ceph has a strong consistency model and in my opinion, strong consistency tends to be better suited to block storage. Why is this? For a 10PB cluster (or larger), failures of various types will be the norm (mostly disk failures). What does Ceph do when a disk fails? It goes to work right away to move whatever data was on the failed disk (using its redundant copies/fragments) to a new place. No big deal if it's only a single HDD that's in failed status at any given point of time. What if you have a server, disk controller, or drive shelf fail? You get a whole bunch of data backfilling going on all at once. The other consideration with strong consistency model is having multi-site storage. Not so good for strong consistency model (due to higher latency for inter-site communication). - Ceph has a ton of knobs, is very feature rich, and high on complexity (although it has improved). The open-source mechanisms for installing and the admin tools have experienced (and continue to have) a high-rate of churn. Do a quick search on how to install/deploy Ceph and you'll see multiple. Same with admin tools. Should you strongly consider Ceph as an option, I would strongly advise you to license and use one of the 3rd party software suites that (a) take the pain away from install/deploy/admin, and (b) reduce the amount of deep expertise that you would need to keep it running successfully. Examples of these 3rd party Ceph admin suites are Croit [0] and OSNEXUS [1]. Alternatively, if you like the idea of a Ceph appliance, I would take a close look at SoftIron [2].

Aside from Ceph, it's worth taking a very close look at OpenStack Swift [3][4]. It's only object storage and has been around for about 10 years. It supports the S3 protocol and also has its own Swift protocol. It's open source and it has an eventually consistent data model. Eventually consistent is (IMO) a much better fit for a 10+PB cluster of objects. Why is this? Because failures can be handled with less urgency and at more opportune times. Additionally, an eventually consistent model makes multi-site storage MUCH easier to deal with.

I suggest going further and spending some quality time with the folks at SwiftStack [5]. Object storage is their game and they're very good at it. They can also help with on-prem vs hosted vs hybrid deployments.

Additionally, you would definitely want to use erasure coding (EC) as opposed to full replication. This is easy enough to do with either Swift or Ceph.

Disclaimers and disclosures - I am not currently (nor have ever been) employed by any of the companies I mentioned above.

Dell EMC Technical Lead and co-author of these documents:

   Dell EMC Ready Architecture for Red Hat Ceph Storage 3.2 - Object Storage Architecture [6]
   Dell EMC Ready Architecture for SwiftStack Storage - Object Storage Architecture Guide [7]

Intel co-author of this document:

   "Accelerating Swift with Intel Cache Acceleration Software" [8]


   [0] https://croit.io
   [1] https://www.osnexus.com/technology/ceph
   [2] https://softiron.com
   [3] https://wiki.openstack.org/wiki/Swift
   [4] https://github.com/openstack/swift
   [5] https://www.swiftstack.com
   [6] https://www.delltechnologies.com/resources/en-us/asset/technical-guides-support-information/solutions/red_hat_ceph_storage_v3-2_object_storage_architecture_guide.pdf
   [7] https://infohub.delltechnologies.com/section-assets/solution-brief-swiftstack-1
   [8] https://www.intel.sg/content/www/xa/en/software/intel-cache-acceleration-software-performance/intel-cache-acceleration-software-performance-accelerating-swift-white-paper.html

👤 rkagerer

Floppies. Lots of floppy disks. Like, 7B of them.

👤 distroguy

In my opinion, you're probably better off building and managing your own infrastructure at that scale, especially if you control the rest of the software stack that runs your platform. It would be best to go with an open source solution and invest in your own technology, infrastructure and people. This way, no matter what happens you can be in control of your data for as long as you want to and avoid vendor lock-in at every level.
If this isn't already something that your company is familiar with, you'll need people who know how to buy, build, test and manage infrastructure across datacentres, including servers and core networking. Understanding platforms like Linux will be critical, as well as monitoring and logging solutions (perhaps like Prometheus and Elastic).
The only solution that I know of which would scale to your requirements would be OpenStack Swift (https://wiki.openstack.org/wiki/Swift). It's explicitly designed as an eventually consistent object store which makes it great for multi-region, and it scales. It is Apache 2.0 licensed, written in Python with a simple REST API (plus support for S3).
The Swift architecture is pretty simple. It has 4 roles (proxy, account, container and object) which you can mix and match on your nodes and can scale independently. The proxy nodes handle all your incoming traffic like retrieving data from clients and sending it onto the object nodes and vice versa. Proxy nodes can be addressed independently rather than through a load balancer and is one of the ways Swift is able to scale out so well. You could start with three and go up to dozens across regions, as required.
The object nodes are pretty simple, they are also Linux machines with a bunch of disks each formatted with a simple XFS file system where they read and write data. Whole files are stored on disk but very large files can be sharded automatically and spread across multiple nodes. You can use replication or erasure coding and the data is scrubbed continuously, so if there is a corrupt object it will be replaced automatically.
Data is automatically kept on different nodes to avoid loss for when a node dies, in which case new copies of the data are made automatically from existing nodes. You can also configure regions and zones to help determine the placement of data across the wider cluster. For example, you could say you want at least one copy of an object per datacentre.
I know that many large companies use Swift and I've personally designed and built large clusters of over 100 nodes (with SwiftStack product) across three datacentres. This gives us three regions (although we mostly use two) and we have a few different DNS entries as entry points into the cluster. For example, we have one at swift.domain.com which resolves to 12 proxy nodes across each region, then others which resolves to proxy nodes in one region only, e.g. swift-dc1.domain.com. This way users can go to a specific region if they want to, or just the wider cluster in general.
We used Linux on commodity hardware, stock 2RU HPE servers with 12 x 12 TB drives (so total cluster size is ~14PB raw), but I'm sure there's a better sweet spot out there. You could also create different types, higher density or faster disk as required, perhaps even an "archive" tier. NVMe is ideal for the account and container services, the rest can be regular SATA/NL-SAS. You want each drive to be addressed individually, so no multi-disk RAID arrays however each of our drives sits on its own single member RAID-0 array in order to make use some caching from the RAID controller (so 12 x RAID-0 arrays per object node).
Our cluster nodes connect to Cisco spine and leaf networking and have multiple networks; e.g. the routeable frontend network for accessing the proxy nodes, private cluster network for accessing objects and the replication network for sending objects around the cluster.
Ceph is another open source option and while I love it as block storage for VMs, I’m not convinced that it’s quite the right design for a large, distributed object store. Compared to Swift object store seems more of an after thought and inherits a system designed for blocks. For example, it is synchronous and latency sensitive, so multi-region can be tricky. Could still be worth looking into, though.
Given the size of your data and ongoing costs of keeping it in AWS, it might be worthwhile investing in a small proof of concept with Swift (and perhaps some others). If you can successfully move your data onto your own infrastructure I'm sure you can not only save money but be in better control overall.
I've worked on upstream OpenStack and I'm sure the community would be very welcoming if you wanted to go that way. Swift is also just a really great piece of technology and I love seeing more people using it :-) Feel free to reach out if you want more details or some help, I'll be glad to do what I can.

👤 molszanski

Pied Piper ?

👤 exdsq

RAM?

👤 zennzei

Wasabi storage

👤 artemist

You almost certainly should not have 10PB of data. Not just is it extremely expensive, it is unlikely that millions of people have each allowed you to take gigabytes of their data. You are sitting on a huge violation of CCPA, GDPR, and other privacy laws, as well as copyright issues. If you are scraping data off the Internet you likely have content illegal to poses in several different countries (such as child sexual abuse material or videos of ISIL killings). As a startup you do not have the legal and technical capabilities to manage this data so you should not have it.

👤 JoelSchmoel

Move to Oracle Cloud and before everybody starts hammering me look at this: https://www.oracle.com/cloud/economics/
I am not from Oracle and I am also running startup with growing pains. Oracle is a bit late to the Cloud game so they are loading up customer's base now and squeezing ears will come in 3-5 years down the road. Maybe you can take advantage of this.