Streams are the most underrated feature in Node.js

TLDR

Streams use much less memory, which is an constraint I tried to mitigate. You can extract 10 GB zip file in an AWS Lambda with 128 MB of max memory. Here is the experiment: pavelloz/streams-vs-buffers

Server and background job

Servers are expensive and if you are building web application you want them to serve requests. If they are not efficient, or overloaded, it will have negative impact on your users/visitors, which then translates to their dissatisfaction. I’m always on the lookout for things that can be extracted from the monolith app to an external service - ie. AWS Lambda.

Default synchronous flow looked like this:

User uploaded zip file to the application server (blocking the request pool shared with web visitors)
Background job extracted files into /tmp directory
HTML/Liquid/YML files were used for to generate database and html pages
Assets were sent to S3 bucket

Unzipping is isolated enough to extract it to AWS Lambda. The code was pretty simple and for one day of hacking and two days of struggling with poor AWS Lambda debugging I had prototype working.

First iteration of Lambda flow looked like this:

User uploaded two zip files (asynchronous)
- Small zip file to the server - only text files with database schemas, html files - usually below 1 MB
- Bigger zip with assets to S3 bucket - usually below 50MB, edge cases achieving >200MB
S3 event (PutObject) was fired everytime zip file was uploaded informing Lambda to extract files and remove the original zip when it’s done

This version of Lambda used Buffer, so the limit was at most memory allocated to the function. Unfortunately, the more you assign memory the more expensive it is to run a function. Memory and CPU on Lambda are tied together ^[1].

Problem

Even though AWS Lambda at that time had 3 GB of memory, it was not enough. Text files (ie. big CSV file, a lot of HTML files) that are highly compressable using zip are sometimes 100x bigger than the zip file.

One day worst case scenario happened: one of our users reported that deploy is not working for him. After quick investigation it turned out that Lambda responsible for unzipping was throwing Out of memory error. It had to be dealt with somehow, even if it’s an edge case. That particular user did push the limits of deploying, by sending a lot of images, videos, pdfs, but since I already knew about existence of streams, I decided to try to fix it.

Solution

I went to the drawing board and did some research on streams. I couldn’t believe all the good words in blog posts, so I prepared my own data and code to test how much memory is used when using buffer and streams.

I had to rewrite the Lambda, but the code was so ugly that I never open sourced it. Hopefully one day I will have time do get back to it. The more important thing was: it worked.

I prepared a script that generated big csv files using Faker.js ^[2]. By changing one variable I can decide how many rows it will generate. I chose CSV file for tests because it is very well compressable, so I didn’t have to generate very big zip file to simulate uncompressing memory usage.

Experiment

To make the experiment more “real-life like” experiment consists of removing some fields, replacing some text (formatting phone number).

To handle stream processing I used excellent highland library. In the production code it was not needed, because AWS S3 supports streaming from and to.

Follow along

If you want to reproduce the test on your machine:

Clone the repository: git clone git@github.com:pavelloz/streams-vs-buffers.git
Install npm dependencies: npm i
Generate data: npm run generate-data
Run experiment: npm run experiment

Here are results for 500000 rows on my local machine. Input file was 340 MB - 370M data.csv

Buffer

Memory used:  1403 MB
Operation completed after:  1370.11 ms

237M outputBuffer.csv

Streams

Memory used:  5 MB
Operation completed after:  1446.20 ms

236M outputStreams.csv

1403 MB vs 5 MB is a pretty big difference. And the best thing is, streams wont take much more than that, no matter how big is the file, while buffers memory usage grows with size of files.

I’m not entirely sure why sizes of generated csv are not exactly the same, if you want to find where is the bug, I will be happy to accept PR or comment :)

Summary

Streams performed worse when it comes to time execution. I was not worried about that, because slower deploy is much better than no deploy at all. Additionally the difference was not substantial. The low memory footprint allowed me to scale down the Lambda to 256 MB of memory and it is still plenty fast for 99 percentile of cases. The edge cases have to wait a little bit longer, but at least it’s not throwing error anymore, which is a success in my book.

I highly encourage you to experiment with streams even if the first time you will use them will be your last. They are different than other concepts I found in programming and when they finally clicked satisfaction was one of the kind.

Footnotes

https://aws.amazon.com/lambda/pricing/ ↩︎
https://github.com/faker-js/faker ↩︎