Behind the data - powering Kwippe's search and art

2017-06-25

development

data, flat buffers, cbor,json, serverless

stacey reiman

One thing I knew when developing Kwippe was that our art had to pull up blazingly fast, or users would click off. Also, search results needed to be returned in a near instantaneous fashion, with as relevant of artwork as possible. In additon, we needed to develop related keyords and categories for thousands upon thousands of images. All of this had to be done by just 1 person: me.

Scary.

Ah and I forgot to add that this needed to be done on a total shoe-string budget (as in, for the less than the price of a good pair of running shoes each month). So I began researching different platforms for deliver search results to an app - as well as pricing plans for databases, rest api services, and any other services that claimed to help serve up your data in some kind of automagical, lightening fast format.

I quickly saw that Firebase - my authentication provider and first stab at database and storage - could become hideously expensive, very quickly. $1 per G of data? Aside from being 10 times the going rate, it’s a bad omen for the speed your data will be served if you begin really gathering steam with tons of simultaneous users. And nobody works this hard to launch the world’s next greatest app in order to grow slowly! While I still like Firebase for authentication and basic user info, I knew it wouldn’t work to serve up my data - in addition to the fact that there’s be no way for that to work offline or in a mobile app.

So after trying out 4 or 5 other database/rest api services - I began wondering if maybe I couldn’t come up with something better - a system where users could access ONLY the data they needed, through the smallest chunks of data possible, delivered as fast as possible. As I began reading about binary versus text based data, and how much more efficient and fast it is to serve binary files to users - I realiized that my dream database may in fact, be no database at all.

What! You must be crazy! An app with tens of thousands - and eventually hundreds of thousands - of images and keywords - run with no database at all? Yup. That’s what I’m telling you. File system only, baby. And here’s how I did it.

First - I ditched the idea that just because I need JSON, meant that I had to store the data as JSON. I looked at everything: Protocol Buffers, FlatBuffers, MessagePack, and a couple of others. But frankly the documentation for that most of stuff in Javascript was pretty terrible, and the main thing I took away from researching all of this was that BINARY was key. Put your stuff into Binary and you’re way better off. Ok. I chose CBOR for my binary format, just because it was super easy to deal with, documentation was excellent, and I could easily encode that data into CBOR via Node.js, then decode it client side, no problema. But that was just the first step to database-free bliss.

The 2nd piece was something I didn’t read about in any of the above resources - and shrunk my binary file sizes by more than 50% - in some cases much more. I used an awesome, uber simple string compression library, to compress the strings before encoding to CBOR. I even convert arrays to strings before saving.

Then it was just a matter of developing a rational system for placing my art files, as well as keyword files, into folders, and creating an index file for each folder - and I was ready to serve up both images and pre-cached keyword searches, all from the file system. And an added benefit - is that all of the artwork files, search terms, and indexes - get cached by your browser - so that unless I update the files, the user doesn’t have to request that data again. This kind of codeless cache management it certainly bliss to this programmer, as managing all of that programatically it just 1 more headache I don’t need.

Now you may be asking “ok but how the heck do you generate all this crap!” Simple, with Node.js. With 2 database systems on my end: the gorgeous and magical RethinkDB on my local machine, as well as good ole Node Dirty for temporary operations. So RethinkDB houses all of our stuff - and is used to regenerate our keyword and related keyword definitions, which do have to be re-done whenever you add artwork. I realize that isn’t the best system - but there isn’t really any way around that if you want to use a database free system.

That part could easily be done each night by chron job.

As for getting related keywords for the images - I developed a few scripts in Node.js to query word api services that return beautiful lists of synonyms, antonyms, and even categories for each term entered. I then match that up with terms that actually exist in our DB, so the user isn’t offered a word that we don’t actually have artwork for.

While I don’t want to give away all of my secret sauce here - I can say that the art, keywords and related keywords are loading faster than I ever imagined, and I like to think that users will notice the difference.