Brief History of Data Engineering

– Blog by Jesse Anderson

In the beginning, there was Google. Google looked over the expanse of the growing internet and realized they’d need scalable systems. They created MapReduce and GFS in 2004. They published the papers for them in the same year.

Doug Cutting took those papers and created Apache Hadoop in 2005.

Cloudera was started in 2008, and HortonWorks started in 2011. They were the first companies to commercialize open source big data technologies and pushed the marketing and commercialization of Hadoop.

Hadoop was hard to program, and Apache Hive came along in 2010 to add SQL. Apache Pig in 2008 came too, but it didn’t ever see as much adoption.

With an immutable file system like HDFS, we needed scalable databases to read and write data randomly. Apache HBase came in 2007, and Apache Cassandra came in 2008. Along the way, there were various explosions of databases within a type, such as GPU, graph, JSON, column-oriented, MPP, and key value.

Hadoop didn’t support doing things in real-time, and Apache Storm was open sourced in 2011. It didn’t get wide adoption as it was a bit early for real-time, and the API was difficult to wield.

Apache Spark came in 2009 and gave a unified batch and streaming engine. It gained in usage and eventually displaced Hadoop.

Apache Flink came in 2011 and gave us our first real streaming engine. It handled the stateful problems of real-time elegantly.

We lacked a scalable pub/sub system. Apache Kafka came in 2011 and gave the industry a much better way to move real-time data. Apache Kafka has its architectural limitations, and Apache Pulsar was released in 2016.

The first big conferences were Strata and Hadoop World that started in in 2012. They eventually merged in 2012. It was the place where the brightest big data minds came and spoke. It was shepherded well by Ben Lorica, Doug Cutting, and Alistair Croll.

There was (and still is) an overall problem in the industry because most projects failed to get into production. Some people blamed the technologies. The technologies more or less work well. Big data projects were given to data scientists and data warehouse teams, where the projects subsequently failed. As clearly evident as that sounds now, my writing about needing data engineering went heavily against the grain of everything that was written at the time.

DJ Patil coined the term Data Scientist in 2008. For the majority of companies, that was the only title working on data problems at scale. Honorable mentions to Paco Nathan, John Thompson, and Tom Davenport who wrote about data science and analytic team management.

Google’s 2015 paper Hidden Technical Debt in Machine Learning Systems highlighted the fact that machine learning isn’t just the creation of models. It is prominently data engineering and all of the technical debt difficulties that come with data.

I started to write about the management side of big data in 2016 by talking about how data engineering is more difficult than other industry trends. I further expanded on these ideas in 2017 by talking about complexity in big data and writing my first book Data Engineering Teams. I continued to help people understand the need for data engineers in 2018 by discussing the differences between data scientists and data engineers. I followed that post up in 2019 by showing that data scientists are not data engineers. In 2020, I published my third book Data Teams to expand on how data teams and business need to cooperate. To share even more best practices and knowledge, I started the Data Dream Team podcast in 2021.

Maxime Beauchemin was writing about data engineering in 2017 too. He wrote The Rise of the Data Engineer, showing how the industry was changing. He followed it up later that year with The Downfall of the Data Engineer to talk about the growing pains of data engineering.

Zhamak Dehghani first introduced data mesh in 2019 as a sociotechnical approach to data. She wrote Data Mesh in 2022 to provide more information about the subject.

Gene Kim talks about the management of data teams in The Unicorn Project, which was published in 2019.

The programming language du jour has changed over the years. At various times it’s been Java, Scala, and Python. Now people are excited about Rust. Large, untyped codebases are landmines in an industry that deals with data.

This brief history leaves out many technologies and companies. Over time, they are dead, dying a slow death, still trying to find their footing, or moving along nicely. Making poor technology choices can make for a late-game failure.

People who don’t know their history are doomed to repeat it. Many data engineers are new and don’t understand the history or the technologies they’re using. There is still a focus on technology and programming languages as the main driver for success or failure. However, people and organizational structure are still the primary drivers for the early success or failure of data projects.

Looking at the technological improvements over the years, we have better tools, but they didn’t make problems easy. None of them took a really hard problem and made it so easy anyone could do it. The gains were 5 to 10 percent improvements in ease, where more time could be spent on business problems because the solution was built-in rather than custom written. I firmly believe that no general-purpose distributed system will make data engineering easy. There isn’t going to be the equivalent of a WordPress event where the bar lowers dramatically.

6 Key Benefits of Cloud Computing for Businesses

– Blog by

Cloud computing has had a transformative effect on businesses of all sizes. It has opened up new possibilities for smaller firms and allowed larger ones to be more reactive. Cloud technology is now an integral part of many people’s working lives, even if they don’t realize it. 

While many of us use cloud computing, there’s still a knowledge gap around the services available via cloud computing. Here are five ways cloud computing can help your business, from improving efficiency to making you more responsive to change. 

It Provides Data Redundancy

Most of us use cloud storage in a personal capacity, whether it’s backing up photos from our phones or storing documents in Google Drive. However, for businesses, a string of notable security breaches have raised concerns. For example, Apple’s iCloud has proven vulnerable to repeated hacking attempts. Plus, recent security breaches to sites such as Facebook and Twitch show that even the biggest providers remain vulnerable.

For many, the result of this reticence is a lack of data redundancy, leaving local data vulnerable to accidental deletion, hardware failure, or other losses. Cloud storage from a managed services provider offers an easy, affordable and secure way to backup your business’ data.

It Allows You To Be More Agile

The IT demands of businesses can be extremely changeable. Fluctuating workloads can create variable bandwidth requirements, with the huge capital investment needed to accommodate rare usage peaks. Meanwhile, new software can demand new infrastructure and skillsets for something that may be temporary. By the time you meet all requirements, you could find yourself lagging behind your competitors.

By taking advantage of cloud computing, you can quickly scale your investment up or down depending on demand. With a managed services provider’s ample expertise and capacity, you can increase bandwidth without investing in your own infrastructure. This allows you to launch services faster, experiment with new technologies, and trial new applications.

It Provides New Insights

On a local network, data is often atomized, with different versions of files spread across multiple workstations. By moving to cloud software solutions, the data you collect and produce is centralized together. This makes it easier to organize and analyze your data and ensures that more is available to analyze.

With each action, transaction, and interaction logged in the Cloud, you’ll have access to a glut of actionable information. Data analytics software can help sift through this data, identifying patterns and trends and highlighting areas for improvement. With little change to your working practices, you can harness otherwise lost information and use it to improve your business practices.

It’s Ideal for Hybrid Work

Modern workplaces are increasingly collaborative spaces. Where once files would be stored locally and passed around, there’s an increasing expectation that they should be available in a centralized location. Rather than creating multiple versions of documents, they should exist in a live state, tracking and merging changes by multiple users.

This is something that is only possible thanks to cloud computing. By making files and documents available within the Cloud, you can access, edit and download them from anywhere. This improves the organization of files and tracking of changes and enables new working patterns. Home workers can access their files remotely and pick up where they left off.

It Can Improve Security

As mentioned previously, there’s a lingering negative perception around data storage in the Cloud. Well-publicized data breaches have made it appear that data is more susceptible to unauthorized access when stored online rather than on your own devices. Yet this is a misapprehension of what cloud computing is and the relative benefits of cloud storage over local storage.

Your local network is generally accessible to remote access through a range of entry points and can often be far less secure when mismanaged than a cloud storage solution. A managed IT services provider offers advanced security options, as well as 24/7 monitoring and management of your cloud storage. This often means data stored in the Cloud is more secure than it is on your network.

It Increases Quality Control

The scattering of data in a decentralized system can have a huge impact on quality. Different people working on different copies of the same documents can lead to all sorts of problems. These range from discerning versions and revisions to completely overwriting new documents with old ones. Also, progress on a document might be unknown to other employees, with files hidden away on individual workstations.

On the cloud, files can be centralized in a single location, with tight version control and a log of revisions. This will provide all employees with clear information about each file, including who had changed it, when it was last changed, and which version or revision it is.

Sota is one of the UK’s leading independent providers of professional IT support in Kent, cloud computing, cyber resilience, connectivity, and unified communications. Having worked with countless businesses over the years, they are experts in their field, ready to advise and offer tailored solutions for each and every company.

Digital Transformation Investment at $3.4 Trillion

– Blog by David H.Deans

Business technology leadership matters. Across the globe, more leaders have been pursuing bold Digital Transformation (DX) initiatives with the goal of creating new sources of business value through digital products, services, and experiences.

As an additional benefit, the COVID-19 pandemic revealed that digital transformation efforts improve an organization’s resilience against global market disruptions.

Global DX investment is forecast to reach $3.4 trillion in 2026 with a five-year compound annual growth rate (CAGR) of 16.3 percent, according to the latest worldwide market study by International Data Corporation (IDC).

Digital Transformation Market Development

“Despite strong headwinds from global supply chain constraints, soaring inflation, political uncertainty, and an impending recession, investment in digital transformation is expected to remain robust,” said Craig Simpson, senior research manager at IDC.

The benefits of investing in DX technology — including automation, analytics intelligence, operational transparency, and direct support around customer or employee experience — all support targeted areas of business growth in an uncertain global economy.

The DX use case that will see the largest investments over the forecast period is Innovate, Scale, and Operate, a broad area covering large-scale operations, including making, building, and designing activities.

Core business functions that make up this area include supply chain management, engineering, design and research, operations, and manufacturing plant floor operations. Innovate, Scale, and Operate will account for more than 20 percent of all DX investments throughout the forecast.

According to the IDC assessment, the next largest use cases are Back-Office Support and Infrastructure at more than 15 percent of all DX spending and Customer Experience at more than 8 percent.

The fastest growing among the more than 300 DX use cases identified by IDC include Digital Twins and Robotic Process Automation-Based Claims Processing with five-year CAGRs of 35.2 percent and 31 percent respectively.

Nearly 30 percent of worldwide DX spending throughout the forecast period will come from the Discrete and Process Manufacturing industries, where Robotic Manufacturing, Autonomic Operations, and Self-Healing Assets and Augmented Maintenance are among the leading use cases.

The next largest industries for DX spending are Professional Services and Retail where Back-Office Support and Infrastructure is the leading DX use case.

The Securities and Investment Services industry will experience the fastest growth in DX spending with a five-year CAGR of 20.6 percent, followed closely by Banking and Healthcare Providers with CAGRs of 19.4 percent and 19.3 percent respectively.

Outlook for Digital Transformation Regional Growth

The United States will be the largest geographic market for DX spending throughout the forecast, accounting for nearly 35 percent of the worldwide total and surpassing the $1 trillion mark in 2025. 

Western Europe will be the second largest region with nearly a quarter of all DX spending. China will see the strongest growth in DX spending with a five-year CAGR of 18.6 percent, followed closely by Latin America with a CAGR of 18.2 percent.

The Asia-Pacific region is expected to grow in double digits across the forecast period where use cases from the Internet of Things (IoT) and Robotics have a high potential within the Manufacturing sector.

That said, I believe the shift to hybrid work and flexible working models will help to advance stable growth in digital transformation investments in the coming years of this amazing decade of transition.