Fear and loathing in bike share
Storing historics of bike share information has been a running topic for me.
The first time I tried I failed miserably. I was using MongoDB as my primary store for the real time information. I added a new collection called “stats” and nonchalantly started throwing data at it. It was the time of hype and webscale things that together with my own naivete made me think the tool would solve the problem for me. I left it running for some time, letting the db get heavier and heavier, whilst I moved to writing a sample endpoint to query this “stat” information. I excitedly shared this endpoint with some interested people from the internet and left it at that.
Queries ran acceptably for a while until they started crashing my system, which was running on a low spec VM. Fine, I decided to ignore the query endpoint and just focus on archiving the information somehow. The database kept growing until it filled all the available space on the machine. Now what? I downloaded a copy of the 70ish Gb database, rm’d it from the server and started over. The next time that it happened I just disabled the flag for the stats feature and moved on.
No matter what I tried to do with this big fat mongo database, I failed. 70 Gb is far from being considered “big data”, but at the time my laptop was not a stallion, my servers were among the cheapest available, and my internet connection sucked. I know, excuses. The closest I got to extracting an export out of it was by using free GCP credits and firing up a machine with more ram and storage. And then I decided to just forget about it and let the problem linger for years.
This is a post about my takeaways from this first time I tried
No silver bullet
Be wary of magic tools. They can help (or make it worse), but do not magically solve your problems. For MongoDB in particular, I gave up on trying to make something out of it. The query language felt alienating and severely limited the effectiveness of the time I invested and the features I could add.
And let’s be fair. I didn’t fail because I decided to go with the most hyped tool of the time or because I didn’t understand non-relational schemas. I failed because I didn’t have a plan, and because I didn’t understand the problem I was trying to solve. Once things started breaking down the only option available for me was to back off.
I wanted to try again, but my experience on how things had unfolded left me with a sour taste on experimentation. I stubbornly decided to not work on the problem until I had:
- A system I could plug in and out of the project without affecting the reliability of the whole system.
- A system I could host myself. Sending the problem away to a hosted provider does not effectively solve it for me and I do not want to be at the mercy of the next invoice. Once you are in and things are running, there’s a hard path out of it, too.
- A plan for resources. Even if my first experiment was non optimal, it showed that whatever store I used would grow considerably over time. RAM is expensive, Storage is expensive. I needed a plan to say how much it would cost me to run and maintain this for the next N years.
- A plan for funding. CityBikes is my side project and it costs both time and money to run. Unless it becomes ‘a project’, it does not make sense to even think about historics, given all the previous points. QED.
And time went by …
… and my project entered maintenance mode the moment I ran out of time and energy. Luckily there were active contributors on pybikes that kept the project updated, and my sole responsibility was to merge code and keep the systems monitored and running. My primary focus was trying to shave hosting costs and automating the tedious parts of running the system – way too much time invested in setting up rate limiting and key authentication to keep the API under fair usage.
From time to time I would get an email asking me if I had historics for bike-share data. Friends asked too. Every time it felt like a punch to my stomach. “IT’S NOT SO EASY OK?”, but I kept feeling I was overcomplicating things, and what’s worst, that everyone thought I was overcomplicating things.
A plan for funding
I started my career as a software developer, leaving the project aside as a background process. Still, most of my roles have been tangentially related to Citybikes. At scrapinghub I was writing spiders, parsers and systems to extract information, much on the same vein of Citybikes. At Kong I was writing lua for the gateway, which was also a tool I was using in Citybikes. When they were called mashape, they had an integration for Citybikes too, so it was all going full circle. Secretly, I thought, I was just learning how to write better software and somehow “getting funded” for keeping the project running.
I have tried at least two times to take a break and focus on Citybikes and it has never really worked out for me. I had the funds, and yet the project was not funded per se, so I would try, and burn out easily.
The last straw was between 2019 and 2021 when I tried to negotiate a contract with one of the companies using the Citybikes API commercially for their product. To me it made sense they would compensate my efforts for running the project for them. After rounds of going back and forth with an SLA that was getting nowhere, I realized I had already spent more on that deal than I would ever take out from it, so I shut down all negotiations and blocked them from using the project. I left from that experience taxed both financially and mentally, and I was very close to just shutting down the project.
To this day, their systems are still trying to hit the API and getting 401d and I am still not very sure what’s going on, really. My only guess is that they have forgotten two rogue machines running forever. I also suspect they have another process running their requests through a proxy service provider, since there is a suspicious number of IPs with the same number of requests and pattern that correlates with their two machines. To me it’s funny to think that they could be running their own instance of Citybikes if they wanted, and decide not to do it.
data:image/s3,"s3://crabby-images/89f38/89f38a4159b765b97076aefbcad8565fe683031d" alt=""
data:image/s3,"s3://crabby-images/20f83/20f83a5ccc9e8fa30e8f979db3e7b7f93ba515d1" alt=""
data:image/s3,"s3://crabby-images/09d54/09d54afe2f77990d676f944c91cc53ba6637ed97" alt=""
Nowadays I try not to think too much about this (I have a button to filter them out from grafana) and instead focus on all the cool things people are building using the Citybikes API. The following are HomeAssistant instances hitting the API. This metric makes me smile, and so I reach for it whenever I need it.
data:image/s3,"s3://crabby-images/c9cac/c9cacbd5ab39e0efe024eda15333ede57b84ab4e" alt=""
FOSDEM ’24
I visited the NLnet dev room at FOSDEM ’24, talked with them about all of this, sent a proposal the next week and 8 months and a some emails later I had a greenlight for my grant proposal. And guess what, one of the milestones I included on my proposal was Historical Data, under the following description:
Real time information is immediately useful for displaying the location and status of stations. But interesting insights about bike sharing systems are lost if this information is not stored over time. Time series information about bike sharing systems is specially interesting for research purposes and will broaden the impact that Citybikes as a project can do to society by making it freely available for non-commercial purposes.
For me, this meant conquering my fears and trying again where I had failed before. And that’s what I will write about next.