Nowadays, healthcare organizations are under increasing pressure to utilize their resources more efficiently while guaranteeing high patient quality care (Foshay and Kuziemsky, 2014; Hanson, 2011). To meet these needs, various researchers and practitioners recommend healthcare organizations to apply big data analytics to ensure effective clinical and administrative decision making (Belle et al., 2015; Sen et al., 2012), improved patient care (Wang and Hajli, 2017; Foshay and Kuziemsky, 2014; Tremblay et al., 2012), and clinical cost reduction (Bates et al., 2014; Pine et al., 2012). In recent years, the healthcare industry has experienced an exponential growth of health data (Fang et al., 2016). Solely in the U.S. healthcare system, health data has reached the scale of zettabytes (1021) of which 80% is unstructured (Wang et al., 2018). Examples of unstructured health data are physicians’ clinical notes and medical images. In 2015 alone, 60 billion images were generated in the U.S. that could be used for more accurate medical diagnosis and improved patient care (IBM Watson Health, 2016).
The paper at hand provides an overview of the rapidly growing field of big data analytics in healthcare in general and the analysis of medical images in particular. Despite the increasing volume and complexity of medical image data, the development of large-scale analytics to bridge the gap between images and diagnoses in near real-time could not keep pace with the needs of computer-aided diagnosis (CAD) or decision support systems in modern healthcare organizations (Zhang and Metaxas, 2016). The goal of this paper is to shed light on innovative large-scale data science techniques in medical image analytics that might benefit clinical decision making, medical diagnosis, and disease exploration. Hence, this paper seeks answers to the following research question: How might big data (medical image) analytics support healthcare organizations in clinical diagnosis?
To answer this research question, the paper proceeds as follows: First, an overview of big data analytics in healthcare is provided with a focus on medical image analytics. Second, two large-scale image analysis cases/ studies are presented to materialize the theory upon which an integrated framework is proposed that illustrates how big data analytics might assist medical diagnosis. Third, contemporary challenges of IT adoption in healthcare are discussed, and lastly, a brief conclusion is drawn.
2 Theoretical Background
2.1 Big Data Analytics in Healthcare
Research defines the concept of big data by the three V’s of data: volume, velocity, and variety (Phillips-Wren et al., 2015; Gandomi and Haider, 2015; Chen et al., 2012). In the healthcare context, data is spread among multiple entities, such as hospitals, health systems, researchers, and governments stored in silos that lack global transparency and access (Belle et al., 2015). Within this realm, various researchers see big data analytics as an enabler to access, store, analyze, and visualize large amounts of data in an integrated, interoperable, and real-time approach that will eventually support many healthcare organizations in decision making and action taking (Watson, 2014; Raghupathi and Raghupathi, 2014). The three different types of data sources that can be found in hospitals are (1) clinical data sources like electronic health records (EHR), laboratory results, and medical images, (2) administrative data sources like personnel and financial data, and (3) external data sources like statistical or social media data (Mettler and Vimarlund, 2009). In order to leverage big data analytics and capitalize on this immense volume of health data, healthcare organizations must implement modern big data solutions for data storage, analysis, and visualization.
Raghupathi and Raghupathi (2014) have proposed an architectural framework for big data analytics in healthcare that highlights different technologies (e.g. Hadoop, Pig, and Hive). Wang et al. (2018) present a similar framework that adds data governance across the data capturing, transformation and consumption layer. The framework of Mettler and Vimarlund (2009) integrates healthcare stakeholders and processes within the technology realm. They argue that big data analytics in healthcare should support management in understanding available internal and external capabilities and facilitate clinical and administrative decision making by integrating all kinds of metrics about a variety of actors. As this paper focuses on leveraging big data analytics to benefit medical diagnosis, Figure 1 illustrates an integrated framework created based on a comprehensive literature review. This framework combines the core approaches of the above-mentioned researchers and allows for a structured analysis based on the common interplay of people, processes, and technology in an organization.
illustration not visible in this excerpt
Figure 1. Integrated Framework for Big Data Analytics in Healthcare. Own representation based on Wang et al. (2018), Raghupathi and Raghupathi (2014), Mettler and Vimarlund (2009).
However, an implementation of this integrated framework involves various challenges, such as limited data sharing due to data privacy regulations; hindered data integration due to required standards, vendor lock-in, and data format variety; complex, timely, and accurate analysis of large-scale information; and weak data visualization solutions (Wang and Hajli, 2017; Belle et al. 2015; Krumholz, 2014).
2.2 Image Analytics in Healthcare
According to Siuly and Zhang (2016), images are an important source for medical diagnosis, therapy assessment, and planning. Well-known imaging techniques are computed tomography (CT), X-ray, magnetic resonance imaging (MRI), and mammography. Generated images are shared using standard protocols like the digital image communication in medicine (DICOM) and stored in picture archiving and communication systems (PACS) (Luo et al., 2016). The data size of medical images can range from a few megabytes to hundreds of megabytes per study (Belle et al., 2015). The growing amount of medical images produced on a daily basis in modern hospitals requires a shift from traditional medical image analysis towards largely scalable solutions offering opportunities for greater use of computer-aided diagnosis (CAD) and decision support systems (Wang et al., 2018; Markonis et al., 2015). The volume, velocity, and variety of medical image data require large data storage capacity as well as fast and accurate algorithms. Many prior studies have tested different methods for image analytics in healthcare (Siuly and Zhang, 2016). Among the classical machine learning methods for data mining, such as supervised learning (i.e. classification) and unsupervised learning (i.e. clustering), more advanced methods like support vector machines (SVM), neuronal networks, and artificial intelligence (AI) are often applied in this realm (Belle et al., 2015; Markonis et al., 2015; Dilsizian and Siegel, 2014). For example, classification and segmentation consist of assigning a label (e.g. healthy or diseased) to a given image, which is represented in a feature space that describes the image (e.g. color and texture). After having been trained, supervised machine learning algorithms are used to predict test image classes based on input visual features (Markonis et al., 2015). To enable these methods, recent studies propose solutions for large-scale medical image analysis based on parallel computing and algorithm optimization (Wang et al, 2018; Luo et al., 2016). Apache Hadoop is an open source framework that allows for the distributed processing of large datasets across computer clusters using simple but powerful programming models, such as MapReduce and Spark (Belle et al., 2015). These and various other techniques can be used to build big data analytics solutions for medical images (Siuly and Zhang, 2016).