In 2012, MIT Slogan published an article explaining the “Big Data Revolution.” The authors emphasized the necessity of utilizing the tremendous amounts of data that are being generated through digitization and the internet of things (IoT). Fast-forward to today. A few short months ago, we penned a paper entitled “Screwed: The Real Value of Data.” In it, we argued that it was no longer a question of whether your organization can transform itself into one that leverages data to build its digital foundation for the future.
Table of Contents
Instead of just leaving you there with this platitude of data being the “digital screws” of the future, we at Inform have decided to put together a three-part series on effective data strategy. It will have a good deal of meat to it that will assist readers in actually understanding the foundation of data strategy and enable them to start down the road to establishing one. In our series, we will give you an introduction to what cutting-edge technology enables us to do, explain what the implications for maritime operations are, and, most importantly, give you an idea of how to access these benefits. In Part 1, which we are launching in Port Technology International’s landmark 100th Edition of the Journal, we kick it off by having a look at data management.
WHAT IS DATA MANAGEMENT IN LAYMAN’S TERMS?
Data management is a term that can be interpreted in various ways. Its definition ranges from use for data sources, storage, connectivity, transfer, transformation, and modeling to less technical terms like governance, security, or cataloguing. So, what is it? Well – all of the above. Data management describes the collection, refinement, and provisioning of data. It includes everything that has to happen between the creation of a data point and that data being made available in an appropriate form for consumption by data analytics, data science, artificial intelligence, operations research, and other advanced computing practices.
WHY IS DATA INTEGRATION RELEVANT?
Spoiler alert: Speed is what is important here.
One key component of data management is data integration. The value of datasets is vastly increased if they are enriched with context, especially across system boundaries. If we can, for example, integrate the information of inbound shipments with metadata from the port of origin, shipping line, container master data, handling equipment parameters, and other sources, we can start to form the digital twin that is quickly becoming a focal point of many ports and terminals around the world.
The availability of associated information with regard to every part of port and terminal operations would incredibly increase transparency and control, and provide the best foundation for making insight-based, split-second decisions. Speed is crucial. Insights, in retrospect, can be helpful in refining processes, but they don’t help you identify problems in real time and certainly don’t give you data-based options for resolving the issues.
MODERN TOOLS MAKE DATA INTEGRATION STRAIGHTFORWARD
Up until recently, the entry cost and effort to implement solutions for Extract, Transform, and Load (ETL), as well as storage (data warehouses especially), was a significant deterrent to implementing data integration as part of your data strategy. Add to this the lack of qualified personnel, and the challenges typically outweighed the ROI potential.
However, since 2012, many things have changed. Big data has just become data. No one really bats an eye at millions of records anymore. Machine learning is ubiquitous in our everyday life, be it in navigation, shopping, meal recommendations, or smart assistants. Computing storage cost and power have been made highly accessible and extremely affordable through the propagation of cloud-based computation business and service models. Before, the scope of data-driven projects used to be limited by the horsepower available in one’s on-premises servers. Nowadays, fully scalable resources are available through cloud providers like Amazon, Google, and Microsoft – to name just a few of the prominent players.
This has seen the cost of storing a terabyte of data in a cloud data warehouse drop to as low as $23 (€20) per month. To put this into perspective, a consumer solid-state disk is five times the cost and does not come with built-in enterprise-level security. The same goes for the ability to run analytics queries on the data. Cloud computing power is scaled to facilitate whatever complex calculation is thrown at it and charged per minute of usage. Gone are the days of paying for dormant CPUs that only spin up occasionally.
Another major development is the emergence of capable ETL tools that, most often, do not only move data from the source to the centralized data storage (be it data lake or warehouse – more on that below), but will also assess data quality (at a rudimentary level), create data models, and, in some cases, will automatically create data marts for immediate consumption by data analytics solutions. Every process along the value-added data chain, where data gets handled, transferred, or transformed, is also often referred to as data in motion. Capable contenders include Qlik Data Integration and TimeXtender as well as proprietary data pipelines like Snowpipe (Snowflake) or Microsoft Azure Data Factory (MS Azure). Other tools come with built-in data catalogues that allow business users to simply shop for data necessary to tackle the business challenges before them.
This allows companies to approach data management in a more flexible and versatile fashion. In traditional systems, the design of the solution determines the necessary data model. Based on the data model, lengthy data architecture projects are necessary to facilitate data analytics projects. If, at a later stage, additional fields or transformations are necessary, these changes could only be embedded after days, weeks, or even months of modeling. This greatly delays the benefits generated by the insights coming from that data, often to the point of redundancy.
Using ETL and its more modern form – Extract, Load, Transform (i.e., you move the data, store it, and then transform it by purpose) tools – combined with data warehouse or data lake automation reduces the time, effort, and human resources required to react to new developments and requirements in the rapidly evolving context of data analytics and data science by up to a factor of 10.
THE TWO MAJOR WAYS TO APPROACH DATA STORAGE
Data in motion (e.g., pipelines, queries, ETL, etc.) is the hot topic, but it is important to touch on data at rest – storage. Multiple paradigms exist today, but the two that are primarily considered when speaking about enterprise data platforms are data warehouses and data lakes.
Data warehouses fell out of favor in the 2010s because of their high costs, maintenance, and effort, especially during modeling, as described above. In their wake, data lakes began their rise in popularity. Data lakes are mostly used in conjunction with the term Big Data. Big Data used to pose a challenge because traditional data storage technology was not designed to ingest and store the enormous amounts of data generated by data streams originating from the internet, IoT, and embedded sensors in smart technology. Today. applications that generate data in the terabyte region would just be called data – omitting the Big since technology has evolved sufficiently to handle the volume and velocity challenges and is deeply embedded in our everyday lives, as previously discussed.
The reason Big Data has just become data is deeply integrated with the emergence of distributed storage – an alternative term for data lakes. Data lakes are designed to rapidly store humongous amounts of data. This makes them incredibly valuable as a device to create a lossless record of any type of data you throw at them. This design principle is also their biggest weakness. They are based on a Hadoop Distributed File System (HDFS). While this file system is incredibly efficient at storing data, it creates a challenge in retrieving the data.
Query accelerators, data lake engines, and other tools like Spark, Impala, or Dremio lower the access barrier to the data by providing an interface generally similar to SQL queries. Above, we described that the value in data lies in the ability to amplify decisions and processes in a highly dynamic and ever-evolving world. Decision-makers are generally not bothered with a cumbersome, often quite technical, query infrastructure to obtain business-relevant data. This is where the renaissance of the business data warehouse begins.
Modern data warehouses are not the monstrous, complicated, maintenance-heavy, and privacy-deficient structures of old. Modern data warehouses are built on the cloud technology discussed above. Snowflake, Azzure Synapse, Google BigQuery, or Amazon Redshift, again, to name just a few, are built with a cloud-native mindset. Pay as you go, pay for what you use, and keep the data close to the business. Today, data lakes offer unstructured – close to raw – data for engineers and data scientists. Data warehouses offer structured, cleaned, refined – close to business – data to self-service business intelligence users, analysts, and, crucially for the port and terminal industry, decision-makers.
LAYING YOUR DATA STRATEGY FOUNDATION
Which is the right form of storage for your business? The answer is not as straightforward as it used to be. While unstructured and raw data are easily stored in data lakes, business requirements are often better satisfied by using a data warehouse. This leads INFORM to often recommend a hybrid approach. Keep unstructured data coming in from vessels, handling equipment, weather feeds, and so on in a data lake for future use and data science projects. Business data originating from TOS, PCS, CRM, and other transnational systems is best refined and stored in data warehouses for rapid access and consumption by decision-makers on any hierarchical and important technical level.
HOW CAN INFORM HELP?
At INFORM, we are always looking to the future to understand what products and solutions we need to be developing and positioning in the port and terminals industry. Building on our decades of experience and rich knowledge base that spans our 800+ strong company, INFORM has been quietly working on our data strategy offering to enrich our customer’s data sets, which in turn enriches our machine learning-, AI-, and operations-research-based algorithms – all of which depend on good quality and, as we learned here, timely access to data. For companies leveraging both the expertise of INFORM’s DataLab and our team’s rich industry experience, our data strategy services are unmatched in the industry. Reach out today for an obligation-free conversation.
Machine Learning improves the accuracy of operational data used for real-time decision-making and long-term strategic management planning.