One of the most important lessons I’ve learned as a data scientist is to acknowledge and communicate repeatedly that data has structure. We need to understand that structure, also known in math jargon as topology, in order to provide accurate, contextualized, and predictively powerful analysis.
To some this seems obvious, to others it sounds foreign. So what exactly do we mean?
Now, I don’t mean data structures in the sense of a technical tool like an array, list, or schema. These are data structures that analysts and programmers use to work data into a format capable of being understood by the computer. Important, but not what I mean here.
I also don’t mean it in terms of the unstructured data vs structured data dichotomy many of us are familiar with either.
Here my meaning is that the structure of the data is its shape. This includes a composite of features in representational, multidimensional, space. This is also known as topology or topological data analysis (in our case).
In data science, we kind of know this intuitively, but don’t often start with a solid conceptual foundation that data has structure beyond the technical schema. That’s a disservice to data analysts and scientists in all domains and can present a mental major barrier to iterative data discovery.
Understanding the structure, the topology, of the data involves getting beyond individual analytics of data into systems analysis that reveals signal from noise over significant interdependence thresholds. This is also known more commonly as Network Analysis. Topology is really the star of the show in data analysis when it can be properly mapped. This provides a strong foundation for predictive and simulation capabilities. Below are several domains of where graph-based network analysis can yield incredible value through understanding the structure of the data.
Image is taken from William Hamilton COMP551: Graph Representation Learning
One tool we have to help us understand structure better are graphs, which is what all the above examples have in common. Now I don’t mean line and pie charts. I mean graphs in the mathematical sense, which are a series of vertices and edges (or, if we want to be pedantic, arcs, if they have direction). In data science, we often call vertices nodes and edges relationships. Graphs such as these can be modeled in many different types of databases, chief among them native graph databases.
(above is a simple interactive graph of which Houses in Game of Thrones faced each other in battle)
Graph databases management systems (GDMS) are a paradigm for storing, retrieving, and analyzing data that describe entities and relationships between them. They are based on the mathematical constructs of graph theory and are distinct, compared to other database types, in that they are built from the ground up to value data about relationships between entities on parity with the entities themselves. Robust logging of these relationships is essential to define structure. Along the data features spectrum, GDMS are best suited to handling highly connected and complex rich data types, providing insight that can’t be (easily) visualized in other ways. Common uses cases include enterprise knowledge graphs, social graphs, and anomaly detection models.
Common GDMS: neo4j, JanusGraph, Giraph, Dgraph. Multi-model databases that also run graph systems include MS Azure CosmosDB, OrientDB, Amazon Neptune.
Most graph databases are composed of distinct Nodes, which most often represent the entities present, and Relationships, which represent a connection of two nodes. Relationships organize nodes into structures, allowing a graph to resemble a list, a tree, a map, or a compound entity, which in turn may be combined into yet more complex, richly inter-connected structures. Think of nodes as fundamental building blocks, like elements, molecules, and cells. Relationships are the shape of how they are put together, and the emergent structures would be proteins, tissues, organs, etc. Defining the patterns and shapes of features that emerge within these structures is what constitutes much of the highest value data science work. It is important to note that in a GDMS both nodes and relationships can also hold property metadata as shown below. While there are several styles of graph databases, one of the most common (shown below) is called a labeled property graph.
Graph databases based on these kinds of structures allow queries to run recursively throughout all specified paths, providing comparatively quick technical performance given the complexity and size of the data. While relational database systems based on SQL can model graphs to a degree, there is an important efficiency gain when using a graph database that may roughly be considered in proportion to the number of joins a particular standard query would take and the resulting technical overhead incurred. Although remember that not every problem is best suited to graphs.
Graph databases have grown in popularity more than any other database type over the past ten years. The hype is strong, and there are many cases where using a graph database may not be the optimal choice i.e., product inventory management. However, if you’re looking to break down silos and understand the value inherent in relationships among your data, whether that’s for your own enterprise or customer, graph-based analysis is worth exploring.