Streams are the most underrated feature in Node.js
Streams use much less memory, which is an constraint I tried to mitigate. You can extract 10 GB zip file in an AWS Lambda with 128 MB of max memory. Here is the experiment: pavelloz/streams-vs-buffers
Server and background job
Servers are expensive and if you are building web application you want them to serve requests. If they are not efficient, or overloaded, it will have negative impact on your users/visitors, which then translates to their dissatisfaction. I’m always on the lookout for things that can be extracted from the monolith app to an external service - ie. AWS Lambda.
Default synchronous flow looked like this:
- User uploaded zip file to the application server (blocking the request pool shared with web visitors)
- Background job extracted files into /tmp directory
- HTML/Liquid/YML files were used for to generate database and html pages
- Assets were sent to S3 bucket
Unzipping is isolated enough to extract it to AWS Lambda. The code was pretty simple and for one day of hacking and two days of struggling with poor AWS Lambda debugging I had prototype working.
First iteration of Lambda flow looked like this:
- User uploaded two zip files (asynchronous)
- Small zip file to the server - only text files with database schemas, html files - usually below 1 MB
- Bigger zip with assets to S3 bucket - usually below 50MB, edge cases achieving >200MB
- S3 event (
PutObject) was fired everytime zip file was uploaded informing Lambda to extract files and remove the original zip when it’s done
This version of Lambda used Buffer, so the limit was at most memory allocated to the function. Unfortunately, the more you assign memory the more expensive it is to run a function. Memory and CPU on Lambda are tied together .
Even though AWS Lambda at that time had 3 GB of memory, it was not enough. Text files (ie. big CSV file, a lot of HTML files) that are highly compressable using
zip are sometimes 100x bigger than the zip file.
One day worst case scenario happened: one of our users reported that deploy is not working for him. After quick investigation it turned out that Lambda responsible for unzipping was throwing
Out of memory error. It had to be dealt with somehow, even if it’s an edge case. That particular user did push the limits of deploying, by sending a lot of images, videos, pdfs, but since I already knew about existence of streams, I decided to try to fix it.
I went to the drawing board and did some research on streams. I couldn’t believe all the good words in blog posts, so I prepared my own data and code to test how much memory is used when using buffer and streams.
I had to rewrite the Lambda, but the code was so ugly that I never open sourced it. Hopefully one day I will have time do get back to it. The more important thing was: it worked.
I prepared a script that generated big csv files using Faker.js . By changing one variable I can decide how many rows it will generate. I chose CSV file for tests because it is very well compressable, so I didn’t have to generate very big zip file to simulate uncompressing memory usage.
To make the experiment more “real-life like” experiment consists of removing some fields, replacing some text (formatting phone number).
To handle stream processing I used excellent highland library. In the production code it was not needed, because AWS S3 supports streaming
If you want to reproduce the test on your machine:
- Clone the repository:
git clone email@example.com:pavelloz/streams-vs-buffers.git
- Install npm dependencies:
- Generate data:
npm run generate-data
- Run experiment:
npm run experiment
Here are results for
500000 rows on my local machine. Input file was 340 MB -
Memory used: 1403 MB Operation completed after: 1370.11 ms 237M outputBuffer.csv
Memory used: 5 MB Operation completed after: 1446.20 ms 236M outputStreams.csv
1403 MB vs 5 MB is a pretty big difference. And the best thing is, streams wont take much more than that, no matter how big is the file, while buffers memory usage grows with size of files.
I’m not entirely sure why sizes of generated csv are not exactly the same, if you want to find where is the bug, I will be happy to accept PR or comment :)
Streams performed worse when it comes to time execution. I was not worried about that, because slower deploy is much better than no deploy at all. Additionally the difference was not substantial. The low memory footprint allowed me to scale down the Lambda to 256 MB of memory and it is still plenty fast for 99 percentile of cases. The edge cases have to wait a little bit longer, but at least it’s not throwing error anymore, which is a success in my book.
I highly encourage you to experiment with streams even if the first time you will use them will be your last. They are different than other concepts I found in programming and when they finally clicked satisfaction was one of the kind.