Breaking News

Heterogeneous graph construction and HinSAGE learning from electronic medical records

An overview of the study methods are illustrated in Fig. 1. The datasets are mapped with the International Classification of Diseases, 10th Revision (ICD-10) code of the subject and imported with comma-separated value files. The extracted files were preprocessed using Python, then implemented for further analysis using Neo4j and the Stellar graph library. The graph network on Neo4j was based on Cypher query language. We used Stellar graph33 version 0.11.1 and Neo4j34 version 4.2.5.

Figure 1

A pipeline for graph construction and graph neural network. The overall process was identical until the editing and manuscript step of the datasets. At this point, the graph schema production step differentiates.

In this study, two types of graph models were built based on different structures and analyzing purposes. The first graph was created on Neo4j for embedding patient records on EMR, which can then be used to efficiently visualize the patient journey and to find the associated data points with relatively easy query input. Our property graph was built with semantic mappings on shallow network embedding. Alternatively, the Stellar graph was used to create the second graph, where neural network prediction was performed.

Data sources and the study cohort

The CardioNet database was composed of EMR data from a total of 53,841 patients with various cardiovascular diseases (CVD) within Seoul Asan Medical Center in South Korea. A patient population that had previously been admitted to the cardiology department, diagnosed with angina (ICD-10 code: I20), who never had been diagnosed with myocardial infarction (MI), stroke, and heart failure were selected. Inpatients alongside data from emergency-room encounters between January 1st, 2000, and December 31st, 2016 were included in the patient cohort; Outpatients were excluded.

Diagnosis, laboratory, echocardiography, physical, medication, surgery, visit and smoke datasets were extracted from CardioNet (Supplementary Fig. 1). The ICD-1035 was used to identify each patient’s health condition at admission. All patients are de-identified to the hospital’s privacy rules, hence the individuals are given a unique patient ID which acts as a key linkage between the datasets. Individuals who were admitted with angina were observed for five years following an event to examine whether the patient suffered from further situations including death, MI, stroke, or heart failure as shown in Fig. 2. This study obtained approval and waived the written informed consent from the Institutional Review Boards of Asan Medical Center (No. 2021-0303). All experiments were performed in accordance with relevant guidelines and regulations.

Figure 2
figure 2

Flow chart of the patients included in the study.

Graph construction with Neo4j

Graph schema construction

Several different schemas were considered during the design phase, which included a process of trial and error of the structures until the most comprehensive schema was produced. To evaluate the best fit schema, we considered factors such as, whether the graph would be able to integrate different types of attributes that belonged to individual nodes and to represent different nodes. Also, every node has to somewhat relate to the patient to provide the attributes of each patient’s medical information. Consequently, the current model was able to satisfy the aforementioned criteria. During the designing process, we observed the following graph schemas: linear and circular, directed and undirected, fully connected and not fully connected, bipartite and non-bipartite, attributed and non-attributed, and weighted and unweighted graphs (Supplementary Data 2). A finalized version of the graph schema is illustrated in Fig. 3. Our graph model was constructed in a patient-centric method and multi-attributed. Further, it was multi-relational in terms of a heterogeneous set of edges, which form interactions within the network with its own edge labels. The graph was also created as a bipartite type, representing a not fully connected model. The relationship between the person node to other nodes were decided based on the consideration that a connection was made from the patient to every medical attribute the one received; your suggested bipartite schema is the optimal representation to retrieve the heterogeneous electronic medical records. See detailed information in Supplementary Data 2.

Figure 3
figure 3

A finalized version of the schema for the patient entity graph. The solid line represents the bipartite relationship between the person and other nodes. In contrast, the dotted line between the nodes represents the possible connection that could be made during the query based on the user’s inquiry on the graph database.

Entity and attribute selection

Ten types of entities were specified (Supplementary Data 3). All entity types contain two different forms of ID. Firstly, the unique patient ID was used for connecting the entities. Secondly, the default internal ID is generated by the Neo4j database. Supplementary Table 1 summarizes the node and node properties along with the data types of the properties for the graph model on Neo4j. The data types of the properties were decided based upon to achieve the maximum level of efficiency for visualization.


To connect these entities, a total of nine different types of relationships were built for connecting the entities described in the previous section. The relationships were labeled according to the association between the entities, where the edge is destined. The starting node and ending node of the linkage were named with head and tail entities, respectively. All the node types, except the person node, begin from the person and are subsequently dispersed (Supplementary Table 2).

Building a graph database using cypher language

After cleaning the raw datasets and creating the schematic representation, the initiation of the graph building began with writing the code to import and represent the EMR datasets. The constraints of all node and edge types were applied to assert whether the patient ID property is unique among the node types. Indexes were then created with the following attributes to support the prediction of the node labels as a form of look-up method. Next, the nodes alongside their corresponding properties were imported with the dataset. The transactions with the periodic commit commander notably solved the deficiency in memory storage issues.

Graph visualization

While constructing the graph database, several factors relating to visualization were considered. The main purpose of the graph modeling on EMR was to detail the patient’s journey through the graph in an effective, but simple method. The model patterns should therefore be able to provide insights at a quicker rate (please see Supplementary Fig. 2 for an example of a patient’s medical journey). Firstly, the colors and sizes of each entity were independently chosen so that each node type matched the corresponding edge types. Secondly, the thickness of the relationships connecting the data components was considered. A powerful benefit of graphs is the ability to show the linkages between the entities in the areas of interest. Lastly, the types of illustrated data attributes were carefully chosen to enhance the viewers’ instinctive visual understanding.

Application: graph neural network with Stellar graph

Graph schema construction

It was decided that a heterogeneous, bipartite graph should be constructed with both node and edge attributes. Therefore two-node types were selected to represent each partition. Further types of datasets were integrated into the form of node attributes on each side. The outcome was included as the edge attribute separately which was specifically coded to define as an outcome column when formatting the edge table in the form of binary type. The outcome value was coded as 1 if the patient was admitted with angina, followed by the occurrence of death, MI, stroke, or heart failure for five years. On the other hand, if the angina patient was not diagnosed with either death, MI, stroke, or heart failure during the five years of follow-up, then the patient outcome was coded with 0. Therefore, the edge indicated 1 for a positive outcome, otherwise, 0 was displayed. Although the naming of two super nodes indicated patient and diagnosis, the tangible information portraying the data points in the graph indicated more than just the patient and their diagnosis.

Feature selection

The edge data frame for the Stellar graph embedding mechanisms was composed of three features: source, target, and outcome. Alternatively, the patient node’s data frame had 12 feature columns, while the diagnosis node’s data frame consisted of 147 feature columns. The details were recorded in Supplementary Data 4. Additionally, the data pre-processing step was identical to the pre-processing in Graph Construction with the Neo4j section.

Stellar graph construction

It is important that the index IDs that are unique to each row of the node’s data frame were used to connect the nodes in the graph with the edges. Therefore, in order to resolve the issues of duplicates forming, prefixes were added to all node indexes, against which the prefixed IDs of edges would eventually be matched. Subsequently, two different types of nodes that had been structured in data frames according to previously listed features were prepared. Eventually, the edge data will summarize each type of node relating to the relevant event outcome information. Overall, a graph model was built in combination with the nodes and edges data frame input.

Graph convolutional network: HinSAGE model

The advent of a relatively-new algorithm called, HinSAGE, a heterogeneous GraphSAGE36 algorithm, enables supervised graph embedding algorithms to maintain not only the topology of the dynamic graph but also the attributes of nodes and edges. GraphSAGE uses a generalized aggregation function inductively. In addition, the impact of applying features of nodes and edges plays a significant role in neural networks since features are the predicates of the subject of the study and thus should not be ignored.

The HinSAGE37 mechanism employs a two-step process of aggregating the representation of a target node. Firstly, by the neighboring node feature representations and by updating the embeddings on the final output of the nodes or graphs produced. To elaborate, we specifically chose HinSAGE because it was the optimal algorithm for applying datasets enriched in multi-attributed node features and relational heterogeneous large datasets. Additionally, the HinSAGE efficiently operates better than other multi-attributed algorithms, particularly for the outcome prediction with the outcome objective attached to the edge attributes.

Initially, in our development process, a graph object was created for a graph topology input. The two-node data frames and one-edge data frames were similarly embedded for heterogeneous graphs. The HinSAGE algorithm performed its samplings on the node neighbors in the graph structure. Following the creation of the graph, the resulting embeddings were input and split into the train and test sets, where the source, target, and labels were trained distinctively. Then, generators were created with specified node types and batch sizes for mapping the samplings. The two-layer HinSAGE model was then built with the input and output tensors exposing the sockets. Moreover, the estimator layer was added on top. Lastly, the Keras model was built for predictions and compiled by the custom optimizer, loss function for the minimization to fit the model, and metrics for evaluation (please see Supplementary Data 5 for full information on the model parameters in the experiment setting).