HACKER Q&A
📣 matusgallik008

How to handle user file uploads?


hey, i work as an SRE for a company where we allow users to upload media files (e.g. profile picture, attach docs or videos to tasks..the usual). We currently just take a S3 pre-signed URL and let the user upload stuff. Occasionally, limits are set on the element for file types.

I don't feel this is safe enough. I also feel we could do better by optimizing images on the BE, or creating thumbnails for videos. But then there is the question of cost on AWS side.

Anybody have experience with any of this? I imagine having a big team and dedicated services for media processing could work, but what about small teams?

All thoughts/discussions are welcome.


  👤 kevincox Accepted Answer ✓
I would encourage not directly using the user-uploaded images. But uploading directly to S3 is probably fine. I just wouldn't use the raw file.

1. Re-encoding the image is a good idea to make it harder to distribute exploits. For example imaging the recent WebP vulnerability. A malicious user could upload a compromised image as their profile picture and pwn anyone who saw that image in the app. There is a chance that the image survives the re-encoding but it is much less likely and at the very least makes your app not the easiest channel to distribute it.

2. It gives a good place to strip metadata. For example you should almost certianlly be stripping geo location. But in general I would recommend stripping everything non-essential.

3. Generating different sizes as you mentioned can be useful.

4. Allows accepting a variety of formats without requiring consumers to support them all. As you just transcode in one place.

I don't know much about the cost on the AWS side, but it seems like you are always at some sort of risk given that if the user knows the bucket name they can create infinite billable requests. Can you create a size limit on the pre-signed URL? That would be a basic line of defence. But you probably also want to validate once the URL expires the data uploaded and decide if it conforms to your expectations (and delete it if you aren't interested in preserving the original data).


👤 4RealFreedom
Read through the comments and was surprised no one mentioned libvips - https://github.com/libvips/libvips. At my current small company we were trying to allow image uploads and started with imagemagick but certain images took too long to process and we were looking for faster alternatives. It's a great tool with minimum overhead. For video thumbnails, we use ffmpeg which is really heavy. We off-load video thumbnail generation to a queue. We've had great luck with these tools.

👤 jfengel
I don't understand the details of cloud stuff, and it took me a heck of a long time to google just what a "pre-signed URL" was. In case anybody else is in the same bucket (ahem):

Users can't upload to your S3 storage because they lack credentials. (It would be dangerous to make it public.) But you can give them access with a specially-generated URL (generated for each time they want to upload). So your server makes a special URL, "signed" with its own authorization. That lets them upload one file, to that specific URL.

(I dunno about anybody else, but I find working with AWS always involves cramming in a lot of security-related concepts that I had never had to think about before. It puts a lot of overhead on simple operations, but presumably is mandatory if you want an application accessible to the world.)


👤 cebert
We allow uploads with pre-singed URLs at my work too. We added a few more constraints. First, we only allow clients to upload files with extensions that are reasonable for our apps. Files uploaded to S3 are quarantined with tags until we validate the binary contents appears to match the signature for a given extension/mime-type with a Lambda. Second, we use ClamAV to scan the uploaded files. Once these checks have completed, we generate a thumbnail with a Lambda and then make the file available to users.

I’m honestly surprise this isn’t a value-added capability offered by AWS S3 because it’s such a common need and undifferentiated work.


👤 dividuum
I recently redesigned my stack of validating uploaded files and creating thumbnails from them. My approach is to have different binaries per file type (currently images JPEG/PNG, videos H264/265 and truetype fonts). Each of them is implemented as in a way that they receive the raw data stream via stdin and then either generate an error message or a raw planar RGBA data stream via stdout. The validation and thumbnail process is triggered after first locking in the process in a seccomp strict mode jail before touching any of the untrustworthy data. Seccomp prevents them from basically every syscall except read/write. Even if there would be an exploit in the format parser, it would very likely not get anywhere as there’s literally nothing it could do except write to stdout. Outside a strict time limit is enforced.

The raw RGBA output is then received and converted back into PNG or similar. It was a bit tricky to get everything working without additional allocation and using syscalls triggered by glibc somewhere, but works pretty well now and is fast enough for my use case (around 20ms/item).


👤 dgoldstein0
Some application security thoughts for serving untrusted content. Not all are required but the main thing is that you don't want the user to be able to serve html or similar (pdf, SVG?) file formats that can use your origin and therefore gain access to anything your origin can do:

- serve on a different top level domain, ideally with random subdomains per uploaded file or user who provides the content. This is really most important for serving document types and probably not for images though SVG I think is the exception as it can have scripting and styling within when loaded outside of an IMG tag

- set "content-security-policy: sandbox" (don't allow scripts and definitely don't allow same origin)

- set "X-Content-Type-Options: no sniff" - disabling sniffing makes it a lot harder to submit an image that's actually interpreted as html or js later.

Transforming the uploaded file also would defeat most exploit paths that depend on sniffing the content type.


👤 atonse
We use Cloudflare Images and Cloudflare Stream (Video) to process images and video that are uploaded to our site.

Both have worked well for us so far but I don't know about your scale and impact on pricing (we're small scale so far).

Cloudflare Images lets you auto resize images to generate thumbnails, etc. Same with video, where they will auto-encode the video based on who's watching it where. So for us it's just a matter of uploading it, getting an identifier, and storing that identifier.


👤 jjice
You'll probably have no issue sending the image to you backend directly, doing whatever you want to it (compression, validation, etc), and then uploading it to S3 from there. It's not a lot of overhead (and I'd argue more testable and easy to run locally).

You can do the math on the ingress to your service (let's say it's EC2), and then the upload from EC2 to S3.

It appears that AWS doesn't charge for ingress [0]. "There is no charge for inbound data transfer across all services in all Regions".

S3 is half a cent per 1000 PUT requests [1], but you were already going to pay that anyway. Storage costs would also be paid anyway with a presigned URL.

You'll have more overhead on your server, but how often do people upload files? It'll depend on your application. Generally, I'd lean towards sending it to your backend until it becomes an issue, but I doubt it ever will. Having the ability to run all the post processing you want and validate it is a big win. It's also just so much easier to test when everything is right there in your application.

[0] https://aws.amazon.com/blogs/architecture/overview-of-data-t...

[1] https://aws.amazon.com/s3/pricing/


👤 fy20
Lots of other comments give good suggestions on how to handle uploading and processing, but nothing mentions serving resulting content, so let me chime in:

Do not serve content from S3 directly.

ISPs often deprioritize traffic from S3, so downloading assets can be very slow. I've seen kbytes/s on a connection that Speedtest.net says has a download speed of 850 mbit. Putting Cloudfront in front of S3 solves that.


👤 qingcharles
I remember one major web host in 2004 .. I noticed they weren't checking the extension of profile pic uploads, so I uploaded a .aspx file that I wrote a file tree explorer into.

From there I could browse through all of their customer's home directories; eventually I found the SQL database admin password, which turned out to be the same as their administrator password for the Windows server it was running on: "internet".

This was a big lesson for me in upload sanitizing.


👤 giaour
> Occasionally, limits are set on the element for file types.

Since this isn't enforced by the presigned PUT URL, you can't trust that the limits have been respected without inspecting whatever was uploaded to S3. You can get a lot more flexibility in what is allowed if you use an S3 presigned POST, which lets you set minimum and maximum allowed content lengths.

[0]: https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-HTTPPO...


👤 p2hari
Look at https://uppy.io/ open source and lot of integrations. You can keep moving to different levels of abstraction as required and see some good practices of how things are done.

👤 JohnCClarke
What do you do with the uploaded images? You could be exposed to risks that may not be immediately obvious.

I have seen a team struggle for over a month to eliminate NSFW content - avatars - uploaded by a user that lead to their site being demonetised.


👤 pier25
For images we simply use Cloudflare Images which takes care of everything.

Images are easy to display but for other media files you will probably need some streaming solution, a player, etc.

We're in the audio hosting and processing space. We still don't have an api though.

For video maybe look into Wistia, Cloudflare, BunnyCdn, etc.


👤 PaywallBuster
2 buckets:

- upload bucket

- processed bucket

upload bucket has an event triggered on new file upload which triggers a lambda, the lambda will re-encode and do wtv you deem fit and upload to new bucket

your app will use the processed bucket


👤 eastoeast
Following a similar stack, has anyone found success handling iPhone HDR and Live Photos? Both seem to give issues to standard HTML formats. I believe we’re using an AWS service to convert videos to various qualities (maybe Elastic Transcoder or Media Convert), and those iPhone video formats causes the service to error out.

👤 andersa
> But then there is the question of cost on AWS side.

Use Cloudflare R2 for storage, public bucket for delivery, and a $10/month Hetzner server to run conversions. You don't have to take your users' money just to burn it on AWS...

If you are using AWS and are worried about cost, the first step is to not use AWS.


👤 time0ut
We map the TUS[0] protocol to S3 multipart upload operations. This lets us obscure the S3 bucket from the client and perform authorize each interaction. The TUS operations are handled by a dedicated micro-service. It could be done in a Lambda or anything.

Once the upload completes we kick off a workflow to virus scan, unzip, decrypt, and process the file depending on what it is. We do some preliminary checks in the service looking at the file name, extension, magic bytes, that sort of stuff and reject anything that is obviously wrong.

For virus scanning, we started with ClamAV[1], but eventually bought a Trend Micro product[2] for reasons that may not apply to you. It is serverless based on SQS, Lambda, and SNS. Works fine.

Once scanned, we do a number of things. For images that you are going to serve back out, you for sure want to re-encode those and strip metadata. I haven't worked directly on that part in years, but my prototype used ImageMagick[3] to do this. I remember being annoyed with a Java binding for it.

[0] https://tus.io/ [1] https://www.clamav.net/ [2] https://cloudone.trendmicro.com/ [3] https://imagemagick.org/index.php


👤 arcza
Kinda incredible how nearly every comment here mentions S3. Cloud storage is not the only backend in existence :)

👤 time4tea
A lot of people here are telling you to do stuff, but not really explaining "why".

A different suggestion would be to build a threat model.

Who are your uploaders? Who are your viewers? What are the threats?

Only when you've figured out what the threats are, can the solutions make sense.

Internal site? Internal users? Publicly available upload link? What media types? ....

Threat modelling is a thing that adds lots of value (imho) when thinking about adding new capabilities to a system.


👤 samslade
Hi! I've been working on this exact problem myself recently and decided to build a product out of it. Take a look: www.bucketscan.com

The intention is to develop an API-driven approach that can be easily integrated into your own products as part of your file upload mechanism. We're really early stage, so we're looking for businesses and individuals to help us define what the product should look like. If you'd be up for sharing your thoughts, you can email us at info@bucketscan.com and/or complete this short product survey:

https://forms.gle/rywgnQ7zqsPuLdMd6


👤 tylergetsay
I have used https://github.com/cshum/imagor infront of S3 before and liked it, there is many (some commercial) offerings for this

👤 nickjj
Beyond taking advantage of validations that are enforced with IAM policies, you can also have a background job handle making thumbnails or whatever you want.

Also I don't think the Content-Type is actually verified by S3 so technically users can still upload malicious files such as an executable with a png extension.

On the bright side, S3 supports requesting a range of bytes. You can use that to perform validation server side afterwards to enforce it's really a png, jpg or whatever format you want. Here's examples in Python and Ruby to verify common image types by reading bytes: https://nickjanetakis.com/blog/validate-file-types-by-readin...


👤 grishka
In my fediverse server project[1], I convert all user-uploaded images to high-quality webp and store them like that. I discard the original files after that's done. I use imgproxy[2] to further resize and convert them on the fly for actual display. In general, I try my best to treat the original user-uploaded files like they're radioactive, getting rid of them as soon as possible.

I don't do videos yet, but I'm kinda terrified of the idea of putting user-uploaded files through ffmpeg if/when I'll support them.

[1] https://github.com/grishka/Smithereen

[2] https://github.com/imgproxy/imgproxy


👤 santiagobasulto
It’s amazing that this was an issue in 2004 and it’s still an issue today. I don’t have much to add aside from what was already said. There are services like uppy, transload it, etc that simplify this, but might be more expensive than S3+CF.

👤 ben_jones
Assuming you serve that content out through a CDN a lot of optimization work will be handled there and customization should also be handled there. I’d be shocked if CDNs don’t allow you to do much/all of that out of the box.

Honestly though if this is an authenticated function and you have a small user base… who cares? Is there a reasonable chance at this disrupting any end user services? Maybe it’s not the best way to spend hundreds of hours and thousands of dollars.

Granted you’re an SRE so it’s your job to ideas this. I’d just push back on defaulting to dropping serious resources on a process that might be entirely superfluous for your use case.


👤 erhaetherth
I tried to do the pre-signed URL thing but gave up quickly. I don't know how you'd do it properly. You're going to want a record of that in your database, right? So what, you have the client upload the image and then send a 2nd request to your server to tell you they uploaded it?

I ended up piping it through my server. I can limit file size, authenticate, insert it into my DB and do what not this way.


👤 sigil
Like you, we use pre-signed S3 upload urls. From there we use Transloadit [0] to crop and sanitize and convert and generate thumbnails. Transloadit is basically ImageMagick-as-a-Service. Running ImageMagick yourself on a huge variety of untrusted user input would be terrifying.

[0] https://transloadit.com/


👤 sakopov
This is how I would build this in AWS. Upload to S3 via pre-signed URLs. Create a notification on the bucket which publishes new objects to an SNS topic. Then create a lambda function with its own dedicated SQS queue subscribing to the SNS topic mentioned earlier. This setup would allow you to post-process new uploads without data loss (and at scale) especially if you use a DLQ. The lambda would drop all post-processed images into another bucket from which you can serve them safely via cloudfront distribution. Beware, if you have a high traffic site this will probably cost you an arm and a leg in S3 costs and Lambda executions. You could consider aging out items in the upload bucket to save some moneys. Similarly, you could use an auto-scaled ECS task instead of lambda, which would scale out (and in) based on the number of items in the queue. Not for the faint of heart, I know :))

👤 junto
You should quarantine them until you’ve analyzed them.

Like you stated, an async process using a function would suffice. Previously used ClamAV for this in a private cloud solution, I’ve also used the built in anti-virus support on Azure Blob Storage if you don’t mind multi-cloud, plus an Azure Function has the ability to support blob triggers, which is a nice feature.

The file types scan is relatively simple. You just need a list of known “magic string” header values to do a comparison, and for that you only need a max of 40 bytes of the beginning of the file to do the check (from memory). Depending on your stack, there are usually some libraries already available to perform the matching.

And it goes without saying, but never trust the client, and always generate your own filenames.

https://en.m.wikipedia.org/wiki/List_of_file_signatures


👤 bagels
If you are a big enough target, people will try to compromise your infrastructure or your users through these uploads.

Some problems you can run in to: Exploiting image or video processing tools with crafted inputs, leading to server compromise when encoding/resizing. Having you host illegal or objectionable material. Causing extreme resource consumption, especially around video processing and storage. Having you host material that in some way compromises your clients (exploits bugs in jpeg libraries, cross site scripting, etc.)

I can't really talk about what is done at the FAANG that I worked at on this stuff, but if you are a large enough target, this is a huge threat vector.


👤 mattpavelle
If you're looking for a good image optimization product, I've had excellent results from ImageOptim (I have no affiliation with them). They have a free Mac App, they have an API service, but also they kindly link to lots of other free similar products and services: https://imageoptim.com/versions.html

If you can spare the CPU cycles (depending on how you optimize, they can actually be expensive) and if your images will be downloaded frequently, your users will thank you.


👤 taitems
Ensure you have adequate cleanup procedures too. I heard from ex-employees of a major car reselling platform that CSAM was distributed by creating a draft car ad, never publishing it and the CSAM was now hosted and accessible via direct image URLs. The records were orphaned, not sure how they got the tip-off.

👤 sim7c00
as poijted out already magika is useful for good file type analysis. also, scan thoroughly for viruses, potebtially using multiple engines. now this can be tricky depending on the confidentiality of the uploaded files so take good care not to submit them to sandboxes, av platforms etc. if not allowed. i would really recommend it though.

if you want to get into the nitty grit of filetype abuse, to learn how to possibly detect that good. ange albertini's work on polymorphic filetypes and the ezine poc||gtfo as well as lots of articles by AV code devs are available. its really hard problem and also depends a lot on what program is interpreting the files submitted. if its some custom tool.there might even be unique vectors to take into account. fuzzing and penteting the upload form and any tools handing these files can shed light on those issues potentially.

(edit: fat fingers)


👤 moomoo11
I use lambda for image uploads.

The function ensures images aren’t bigger than 10mb, the image is compressed to our sizing, and put into s3.


👤 hamandcheese
For what its worth, processing the files is probably more risky for your internal infra than doing nothing. I've seen a RCE exploit from resizing profile images before.

On the other hand, not processing/scanning your uploads is probably more risky for your users/the rest of the internet.


👤 stevoski
My SaaS uses Cloudinary for uploading and storing images.

It’s not particularly cheap. But it is fast and flexible and safe.


👤 jftuga
Slight OT...

I created a program for profile pictures. It uses face recognition technology as to not deform faces when resizing photos. This may be useful to you.

https://github.com/jftuga/photo_id_resizer


👤 b-karl
I was part of designing a user file upload, it was a B2B product,limited number of users and in principle trusted users but similar to other comments we did something like:

- some file type and size checks in web app

- pre-signed URL

- upload bucket

- lambdas for processing and sanity checks

- processed bucket


👤 tempaccount1234
It depends on how small. I’m working with really small (like less than 5mb per upload) and use a FastAPI endpoint on my API to receive the file and then have other Python code save it (or reject it)

👤 efxhoy
We run an image scaler on aws lambda based on libvips. We cache the responses from it with cloudflare. We compared to letting cloudflare handle the scaling and the lambda was several times cheaper.

👤 ytch
https://github.com/google/magika

    Magika is a novel AI powered file type detection tool that relies on the recent advance of deep learning to provide accurate detection. 
It's not a sliver bullet, but I use it recently for inspecting file type instead of magic file.

One advantage is that detecting composed files. Take pdf+exe file for example, the library will report something like 70% pdf and 30% pebin file.


👤 marcpaq
Have you considered a commercial solution?

https://developer.massive.io/js-uploader/

Just point it to your local files, as big as you want, and it does the rest. It handles the complexities of S3 juggling and browser constraints. Your users pay nothing to upload, you pay for egress.

Full disclosure: I'm MASV's developer advocate.


👤 j45
I'm not affiliated but a cloud file service from someone like backblaze may interest you

👤 brudgers
I don't feel this is safe enough. I also feel we could do better...

What is the business case for making the necessary changes?

Good luck.


👤 TacticalCoder
> I don't feel this is safe enough. I also feel we could do better by optimizing images on the BE, or creating thumbnails for videos.

Yeah definitely. Even optimizing the vids. I just spend time writing scripts to convert, in parallel, a massive amount of JPG, PNG, PDFs, mp4 videos and even some HEIC files customers sent of their ID (identity card or passport, basically). I did shrink them all to reasonable size.

The issue is: if you let user do anything, you'll have that one user, once in a while, that shall send a 30 MB JPG of his ID. Recto. Than Verso.

Then the signed contracts: imagine a user printing a 15 pages contract, signing/paraphing every single page, then not scanning it but taking a 30 MB picture, with his phone, in diagonal, in perspective. And sending all the files individually.

After a decade, this represented a massive amount of data.

It was beautiful to crush that data to anywhere from 1/4th to 1/10th of its size and see all the cores working at full speed, compressing everything to reasonable sizes.

Many sites and 3rd party identity verification services (whatever these are called) do put limit on the allowed size per document, which already helps.

In my case I simply used ImageMagick (mogrify), ffmpeg (to convert to x265) and... GhostScript (good old gs command). PDFs didn't have to be searchable for text so there's that too (and often already weren't at least not easily, due to users taking pictures then creating a PDF out of the picture).

This was not in Amazon S3 but basically all in Google Workspace: it was for a SME to make everything leaner, snapper, quicker, smaller. Cheaper too (no need to buy additional storage).

Backups of all the originals, full size, files were of course made too but these shall probably never be needed.

In my case I downloaded everything. Both to create backups (offsite, offline) and to crush everything locally (simply on an AMD 7700X: powerful enough as long as you don't have months of videos to encode).

> Anybody have experience with any of this? I imagine having a big team and dedicated services for media processing could work, but what about small teams?

I did it as a one-person job. Putting limits in place or automatically resizing, right after upload, a 30 MB JPG file which you know if of an ID card to a 3 MB JPG file doesn't require a team.

Same for invoking the following to downsize vids:

    ffmpeg -i input.mp4 -vcodec libx265 -crf 28 output.mp4    (I think that's what I'm using)
My script's logic were quite simple: files above a certain size were candidates for downsizing then downsizing then if the output was successful and took less than a certain amount of time, use that, otherwise keep the original.

I didn't bother verifying that the files visually matched (once again: all the originals are available on offline backups in case something went south and some file is really badly needed) but I could have done that too. There was a blog post posted here a few years ago where a book author would visually compare thumbnails of different revisions of his book, to make sure that nothing changed too much between two minor revisions. I considered doing something similar but didn't bother.

Needless to say my client is very happy with the results and the savings.

YMMV but worked for me and worked for my client.


👤 squigz
Is HN turning into StackOverflow now?

👤 whoknowsidont
Is this really the state of the industry? Where an SRE is asking how to handle user media on the web?

I'm not diminishing asking the question in principle, I'm questioning the role and forum that the question is being asked on.