Tuesday, October 7, 2014

Common distributed systems implementation practices at different companies



Reflecting back on a number of projects that I have worked on or encountered at different companies, I want to share a few thoughts about how distributed systems are being used that I have seen emerge:

  •        A lot of companies are still at early stages with Hadoop and distributed systems. The main use case aims to supplement/replace more expensive systems, like Teradata or Oracle. A common trend is to start doing Data science (80% of cases being about recommendation engines).
  •        Kafka is now a commonly implemented system to retrieve logs from a live application and ingest into Hadoop, compared to Flume or other messaging system like RabbitMQ these days. I believe it is thanks to Kafka’s relatively easiness of deployment and its reputation for scalability and fast ingest time.
  •     The demand for resource managers in distributed systems while generating a lot of hype, is almost inexistent! Most small/medium-sized companies just don’t want to bother with adding an existing layer on top of their cluster to control application/user management like Mesos, and prefer to implement multiple clusters for each usage: one for Production, one for Development, etc. Some startups ever go further, and use a cluster per engineer! Why? Because typically, these companies don’t have that much data to wrestle with, and the cost of spinning up a cluster dynamically on AWS is very cheap. Even bigger companies do a lot of experimentations: Apple disclosed recently at the 2014 Cassandra summit that they use upwards of 75k Cassandra nodes, across a multitude of clusters. So Resource management across applications is low priority generally in practice; in the Hadoop world, resource management is generally limited to disk quota and an out-of-the-box priority queueing system on a user group basis.
  •      Companies generally don’t have a good thought-of big data architecture, and are willing to try different tools and live with a hodge-podge of them; an example is Airbnb that runs both Hive and Presto concurrently on the same data, or Twitter with a fair number of Scalding users but also Spark.
  •     Avro is typically the data format mostly used by small/medium companies, thanks to its advantage of schema evolution and multi-language support. Parquet is recognized as the newcomer, and thought of as more performant and more cross-platform than its nemesis, ORC, which really only supports Hive. Of note is the decline of Java programmers for Python in the data world.
  •     Spark as a distributed framework for big data is slowly taking over, as people are realizing Hadoop APIs are just too low-level, and Cascading APIs are good but not frankly as well written than Spark’s. Also regarding resource management, Databricks/Spark is rumored to get its own resource management tool soon, which would circumvent the need for YARN’s if in a Spark only environment.



0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.