A Short History of Data

The expression “data is the lifeblood of business” is a well-worn phrase that has garnered much use in recent years. It’s certainly true that all businesses create and process data on a daily basis as part of normal operations, but the volume and diversity of information flowing through every type of organisation is greater than it ever was.

Artificial Intelligence, as one current example, is driving demand for quantities of data where incremental gains in the accuracy of AI models is only seen with an order of magnitude of additional information. Businesses continually need more data to drive more insights and potential customer value.

In reality, data has always been at the centre of business. Before the widespread use of technology, businesses would record transactions in physical ledgers, that would be written by hand and rarely reviewed. As digital technology was introduced from the 1950s, all types of businesses and organisations became increasingly dependent on the storage and retrieval of electronic records, as this digital data offered fast and automated analysis and insights compared to human review. One very early example of a company spotting the potential of computing is J Lyons & Co and the LEO computer, which dates back to 1951 and is widely regarded as the first computer created for commercial business applications.

The first electronic computers developed in the 1940s were more arithmetic calculators than data processors. ENIAC was programmed manually to perform its operational tasks, with earlier machines performing similar tasks of complex computations. The “Baby” developed by the University of Manchester was the first stored program computer, which eventually formed the Harvard architecture and the Von Neumann architecture of all modern computing systems (with the arguable exception of quantum computing).

The Von Neumann design consists of processing and memory functions, with data and code read from external input devices. This is what today we would consider persistent storage. Early computing platforms used magnetic tape for persistent storage (UNIVAC being the first in 1951), followed by magnetic disk introduced by IBM in 1956 (as part of the IBM 305 system). Today we have a wide range of persistent storage media that still includes tape and disk but is increasingly moving towards solid-state technology such as NAND flash. There are also new recording media including resistive RAM and ceramic technologies that aim to directly compete with the dominant flash technology.

Today, businesses of all types are almost 100% reliant on information technology to operate. Three decades ago, if computer systems went down, most businesses could revert to manual operations and catch up when computers systems came back online.

Today, data is created from so many diverse sources (a topic we will discuss in a moment) that can’t be generated manually. As a result, when computer systems are down, businesses aren’t operating. Computer systems and the data they hold have become essential for the operation of a modern business.

The Data Centre

The terms “data processing” and “data centre”, now part of our accepted lexicon, were coined in the 1960s[CE1] and chosen with good reason. Computing devices at that time were big and expensive, requiring careful management and specific technical skills.
Distributed networking barely existed (DARPANET, the forerunner of the modern Internet, went live in 1969[CE2]), so all business information was centralised on mainframe computing systems, essentially a single “centre” for the processing of data.

As networking has become ubiquitous and costs have plummeted, businesses have been able to provide remote computing capabilities, initially with “dumb” terminals, followed by distributed computing (now being rebranded as edge computing). Each of the last four decades has seen an expansion and diversification of computing infrastructure and solutions. The modern data centre is no longer a single place, but a virtual entity that is an aggregation of many disparate products, solutions[CE3] and operating models.

Figure X shows the diversity of infrastructure and solutions used to deliver modern IT. Today, a typical business may have systems in an on-premises data centre, the public cloud, at the edge and use SaaS-based business tools. This mix can be fluid and the balance of data in each location may change over time. All of this data needs protecting in some way, because all of this information is critical to the operation of a modern enterprise[CE4].

Over the lifetime of a business, data is created, collected, and processed. In the early days of computing, most data were human-generated and stored in transactional databases. So called “structured data” is now being overtaken by the creation of unstructured content, including information generated by both human and non-human sources (such as cameras, sensors or medical scanning equipment). All of this data has value to the business, either as part of day-to-day operations or for generating future value. Increasingly, unstructured data is being seen as a source for Artificial Intelligence, either to train large language models to to supplement LLM applications through the concept of retrieval augmented generation (RAG).

During the last decade, data has become a strategic asset for businesses. Traditional data warehouses have become data “lakes” that are mined using Artificial Intelligence software to generate inferences which can in turn be applied to new data entering the system. An AI model is “trained” on huge quantities of data, then applies that knowledge to new data in order to discover insights or other information that couldn’t easily be determined by humans. As AI models evolve, the volumes of data being processed now extend into multi-petabytes of capacity. Side Note: the volume of data needed for LLMs is now so vast that companies are creating “synthetic” data to further expand the accuracy and capability of LLM models. This is yet another source of information that must be retained for any LLM retraining process.

Data Sources

We mentioned earlier about the early forms of data creation as computers started to become widespread. When IT systems were first developed, data was entered manually from existing systems and processes. Data entry clerks transcribed paper records into electronic data for further processing[CE5]. With the widespread use of low-cost endpoint devices (starting with the personal computer), businesses were able to make data creation and data entry part of a typical job specification. Early desktop solutions included word processing, spreadsheets and rudimentary databases. Today, we would hardly consider adding computer skills on a job advertisement as the ability to use a keyboard and mouse are seen as ubiquitous and essential in any office-based job.

From manually entered data typically stored in structured databases, the types and volume of business data has increased exponentially. Human-created data within a business, forms only a fraction of the data created within a commercial organisation every day. Much more information is created by customers using websites and applications, manufacturing and machinery creating products, data from sensors or other logging equipment, plus data created in the form of analytics or machine learning.

There is now also a significant volume of data used to track activity across information technology (both good and bad), with many multiples of application data retained for archiving and data recovery. Data protection and poor data management practices can result in data sprawl, with many multiples of data duplication, with some estimates as high as 10x levels of redundant copies.

By 2025, IDC estimates that approximately 175 zettabytes of data will be created annually, with a compound annual growth rate (CAGR) of around[CE6] 23%. However, only around 2% of data from 2020 was saved into 2021, demonstrating the transient nature of what we create. Of course, some data is filtered and consolidated, rather than being immediately discarded, a process which can’t easily be recreated if that information is lost.

Despite the attrition rate, there is constant pressure on IT organisations to store and manage ever greater volumes of data, with an increasingly unpredictable view on whether that information will ever be used again. Many businesses follow the mantra of “retain everything”, which has limits of sustainability and cost. Keeping all data forever also introduces risk for the business, if processes aren’t sufficient to track and inventory content in a timely fashion.

Data Lifecycle

What is driving this instability on future data value? For many decades, data had a predictable value over its lifetime. New records in a database, for example, would be frequently accessed when the data was created, then see a decrease in activity over time, before being archived and eventually deleted. This curve was well known and predictable.

Credit card processing is a good example, to illustrate data lifecycle. Historically, when a customer made a purchase, the transaction data would be frequently accessed by the both the customer’s and merchant’s bank. Assuming there were no queries with the transaction, the data would be archived after being used to produce a printed statement. Eventually the credit card company would delete the data, perhaps a fixed amount of time after an account was closed or based on whatever regulatory compliance rules required. The original transaction data would quickly lose value as it aged out.

Today if a customer uses a credit card, the transaction might get reviewed through a mobile app or online. Banks generally enable customer searching of data a few years into the past. The bank and credit card provider may also use the data for real-time fraud detection, looking at past purchasing trends to determine whether to authorise or decline a transaction. Banks use historical customer data to determine credit limits, both upwards and downwards (many credit card providers will reduce a credit limit to mitigate losses if fraud occurs). Merchants will use credit card data to predict spending habits, for example, offering discount or emailing promotions to prospective buyers.

One of the greatest challenges for businesses is to determine when and if data will be useful once past the point of initial creation. Where data used to decline in value over time, the increasing use of analytics means data can have some, as yet, unspecified future value, especially when combined with new data sources.

In the early days of computing, businesses could predict that the volume of data within an organisation was generally proportional to the volume of business being done. Today, data is retained on the assumption of future value and an, as yet, unmet need to keep an organisation competitive.

Changing Practices

It is true to say that storage infrastructure has never been so cheap, when reviewing a simple $/GB benchmark. Barring one or two blips (due to manufacturing issues, for example the Thailand floods of 2011), storage costs have always declined rapidly and consistently over time. The cost of a new modern hard disk drive model, for example, has been static at around $650 when first introduced to the market, irrespective of capacity. This is because the bill of materials (BoM) hasn’t changed much in two decades.

Many businesses look at data storage and view the declining price of storage media as an opportunity to retain data forever. However, data management costs and data storage costs are not the same. In addition, the volume of data being stored in the enterprise is growing at a rate that is faster than the reduction in storage media costs when measured over the long term. Data management adds a premium on top of the cost of hardware acquisition. If data is retained on-premises, then there are also facilities costs such as data centre space, power and cooling. In the public cloud, data is charged at a monthly recurring fee for storage and transactional charges for access. The cost of data retention is therefore both a capital and operational cost, depending on where that data is located.

Retaining data forever becomes a significant form of technical debt, if not managed correctly. Many organisations have no insight into what’s being stored and where (back to our four application deployment models). If a business doesn’t know what data assets it possesses, how can they adequately be protected?

Finally, we should point out the challenges of regulation. In many industries (notably finance and healthcare), extensive regulation exists to ensure data is retained safely for many years past the point of its original use. Businesses can be fined heavily for breaching the rules that govern data management. Outside of industry-specific regulation, GDPR and other country-based constraints place responsibility on businesses to maintain responsible stewardship of personal data, including timely reporting of data loss or data breaches.

In addition to rules on data management, governments are starting to introduce measures to ensure businesses can recover data in the event of a malware or ransomware attack. In 2025, the Digital Operational Resilience Act (DORA) became law in the EU, requiring business to improve their digital operational resilience. In 2023, the US government started development of the US Cyber Trust Mark, a labelling programme to ensure IoT devices confirm to a strong cybersecurity framework.

Power and Responsibility

The value of data provides businesses with competitive advantage, and used correctly has great power. However, at the same time, businesses have a responsibility to use data ethically and within the boundaries put in place by industry regulators. For both strategic and governance reasons, data must be protected from loss or corruption. We will touch on the regulatory issues of data management later in this book, but for now, we need to consider that data protection is an essential business function for all organisations.

In the EU, the government introduced the EU AI Act, aimed at creating standards of governance for the use of data within AI-focused applications. The USA also introduced similar AI governance rules, but these were repealed by incoming President Donald Trump during his first days in office[CE7].

Backup versus Archiving

We will come back and discuss the topic in more detail later in this book, but for now we should explain the difference between backup and archiving. Here are two straightforward definitions that explain the difference between the process of backup and archiving.

Definition: Data protection is the creation of a point-in-time copy of data that enables recovery of that data at some point in the future, back to the previous time point.

Definition: Archiving is the process of moving the primary copy of data out of storage systems to a secondary, (typically cheaper) location, with a view to retaining the data for future reference but not using that data in day-to-day business activities.

As we can see, creating a backup results in an additional copy of data being taken, one that represents a static point in time we can return to at a later date. For various reasons, which we will discuss in a moment, it may be necessary to restore all or part of a backup to a previous point, due to issues with the main or current copy.

An archive represents a different process, where data is moved out of a primary application and placed elsewhere for future reference, generally on cheaper storage media. Archiving has several benefits. Firstly, it saves on primary disk space that may be an expensive resource. Secondly, it makes applications work faster by reducing the volume of data in search and retrieval functions. The savings made from archiving aren’t just in storage space but can result in lower memory requirements and reduced CPU utilisation. Third, archive data can be stored in a format that provides for better processing, for example, in AI applications or other OLAP applications.

Historically, backup has been used as a “poor quality” archive, with the assumption that data can be retrieved from a backup if needed. Data that would be otherwise archived is simply deleted from the primary system. This approach is risky, as the restore process can be protracted and lengthy. In addition, backups end up being retained much longer than is needed, simply to service the requirement for archived data recovery.

In modern IT infrastructure, archived data can easily be moved into a data warehouse or data lake, with specialist tools to provide data insights and analysis. We’ll touch on backup versus archiving in more detail later in this book.