– Blog by Jesse Anderson
In the beginning, there was Google. Google looked over the expanse of the growing internet and realized they’d need scalable systems. They created MapReduce and GFS in 2004. They published the papers for them in the same year.
Doug Cutting took those papers and created Apache Hadoop in 2005.
Cloudera was started in 2008, and HortonWorks started in 2011. They were the first companies to commercialize open source big data technologies and pushed the marketing and commercialization of Hadoop.
Hadoop was hard to program, and Apache Hive came along in 2010 to add SQL. Apache Pig in 2008 came too, but it didn’t ever see as much adoption.
With an immutable file system like HDFS, we needed scalable databases to read and write data randomly. Apache HBase came in 2007, and Apache Cassandra came in 2008. Along the way, there were various explosions of databases within a type, such as GPU, graph, JSON, column-oriented, MPP, and key value.
Hadoop didn’t support doing things in real-time, and Apache Storm was open sourced in 2011. It didn’t get wide adoption as it was a bit early for real-time, and the API was difficult to wield.
Apache Spark came in 2009 and gave a unified batch and streaming engine. It gained in usage and eventually displaced Hadoop.
Apache Flink came in 2011 and gave us our first real streaming engine. It handled the stateful problems of real-time elegantly.
We lacked a scalable pub/sub system. Apache Kafka came in 2011 and gave the industry a much better way to move real-time data. Apache Kafka has its architectural limitations, and Apache Pulsar was released in 2016.
The first big conferences were Strata and Hadoop World that started in in 2012. They eventually merged in 2012. It was the place where the brightest big data minds came and spoke. It was shepherded well by Ben Lorica, Doug Cutting, and Alistair Croll.
There was (and still is) an overall problem in the industry because most projects failed to get into production. Some people blamed the technologies. The technologies more or less work well. Big data projects were given to data scientists and data warehouse teams, where the projects subsequently failed. As clearly evident as that sounds now, my writing about needing data engineering went heavily against the grain of everything that was written at the time.
DJ Patil coined the term Data Scientist in 2008. For the majority of companies, that was the only title working on data problems at scale. Honorable mentions to Paco Nathan, John Thompson, and Tom Davenport who wrote about data science and analytic team management.
Google’s 2015 paper Hidden Technical Debt in Machine Learning Systems highlighted the fact that machine learning isn’t just the creation of models. It is prominently data engineering and all of the technical debt difficulties that come with data.
I started to write about the management side of big data in 2016 by talking about how data engineering is more difficult than other industry trends. I further expanded on these ideas in 2017 by talking about complexity in big data and writing my first book Data Engineering Teams. I continued to help people understand the need for data engineers in 2018 by discussing the differences between data scientists and data engineers. I followed that post up in 2019 by showing that data scientists are not data engineers. In 2020, I published my third book Data Teams to expand on how data teams and business need to cooperate. To share even more best practices and knowledge, I started the Data Dream Team podcast in 2021.
Maxime Beauchemin was writing about data engineering in 2017 too. He wrote The Rise of the Data Engineer, showing how the industry was changing. He followed it up later that year with The Downfall of the Data Engineer to talk about the growing pains of data engineering.
Zhamak Dehghani first introduced data mesh in 2019 as a sociotechnical approach to data. She wrote Data Mesh in 2022 to provide more information about the subject.
Gene Kim talks about the management of data teams in The Unicorn Project, which was published in 2019.
The programming language du jour has changed over the years. At various times it’s been Java, Scala, and Python. Now people are excited about Rust. Large, untyped codebases are landmines in an industry that deals with data.
This brief history leaves out many technologies and companies. Over time, they are dead, dying a slow death, still trying to find their footing, or moving along nicely. Making poor technology choices can make for a late-game failure.
People who don’t know their history are doomed to repeat it. Many data engineers are new and don’t understand the history or the technologies they’re using. There is still a focus on technology and programming languages as the main driver for success or failure. However, people and organizational structure are still the primary drivers for the early success or failure of data projects.
Looking at the technological improvements over the years, we have better tools, but they didn’t make problems easy. None of them took a really hard problem and made it so easy anyone could do it. The gains were 5 to 10 percent improvements in ease, where more time could be spent on business problems because the solution was built-in rather than custom written. I firmly believe that no general-purpose distributed system will make data engineering easy. There isn’t going to be the equivalent of a WordPress event where the bar lowers dramatically.