Big Data for Healthcare
- Date
taught by Jimeng Sun
In CSE 6250: Big Data Healthcare, I learnt the following key concepts and topics:
Medical Data Characteristics and Data Mining Challenges:
Overview of Big Data: Understanding the definition, characteristics, and significance of big data in healthcare.
Healthcare Data Sources: Familiarity with various sources of healthcare data, including electronic health records (EHRs), medical imaging, genomics, and wearable devices.
Complexity and heterogeneity: Understanding the complexity and heterogeneity of medical data.
Addressing the challenges in analyzing large and complex healthcare datasets.
Predictive modeling in healthcare contexts.
Computational phenotyping to identify patterns in medical data.
Patient similarity analysis for personalized healthcare.
Data Management and Processing
Data Storage Solutions: Exploring storage solutions for large-scale healthcare data, such as Hadoop, NoSQL databases, and data lakes.
Data Integration: Techniques for integrating diverse healthcare data sources to create unified datasets for analysis.
Data Cleaning and Preprocessing
Data Quality Issues: Identifying and addressing data quality issues, such as missing data, inconsistencies, and errors.
Preprocessing Techniques: Applying preprocessing techniques to clean and prepare healthcare data for analysis.
Big Data Analytics
Descriptive Analytics: Using big data tools to summarize and describe healthcare data.
Predictive Analytics: Applying machine learning algorithms to predict healthcare outcomes, such as disease progression and patient readmissions.
Prescriptive Analytics: Developing models to recommend actions for improving healthcare outcomes.
Machine Learning and Artificial Intelligence
Supervised Learning: Techniques such as regression, classification, and ensemble methods.
Unsupervised Learning: Clustering, dimensionality reduction, and anomaly detection.
Deep Learning: Neural networks, convolutional neural networks (CNNs), and recurrent neural networks (RNNs) for healthcare applications.
Healthcare-Specific Applications
Genomic Data Analysis: Techniques for analyzing and interpreting genomic data.
Medical Imaging: Applying big data techniques to analyze medical images for diagnosis and treatment planning.
Natural Language Processing (NLP): Using NLP to extract information from clinical notes and other unstructured text data.
Privacy and Security
HIPAA Compliance: Understanding the legal and regulatory requirements for protecting patient data.
Data Security Techniques: Implementing security measures to protect healthcare data from breaches and unauthorized access.
Big Data Analytic Systems:
Utilization of Hadoop family tools like Hive, Pig, HBase.
Implementation of Spark for big data processing.
Use of Graph Databases for complex data relationships.
Using tools like Tableau, D3.js, and Power BI to create meaningful visualizations of healthcare data.
Pre-Requisite Knowledge and Skills:
Machine learning and data mining concepts such as classification and clustering.
Proficiency in programming languages like Scala, Python, and Java.
Experience with data handling and ETL processes, including SQL and NoSQL databases.
Ethical Considerations
Ethical Issues: Understanding the ethical implications of big data analytics in healthcare, including patient consent and data bias.
Responsible Use: Strategies for ensuring the responsible use of big data in healthcare to improve patient outcomes without compromising ethical standards.
Practical Applications:
Applying big data techniques to real-world healthcare problems.
Working on projects and assignments that simulate real-world data analytics scenarios in healthcare.
The course also emphasized the importance of teamwork, participation, and adherence to academic integrity standards.
Through assignments and project work I gained practical knowledge and experience in the following areas:
Medical Data Analysis:
Working with large, heterogeneous datasets from healthcare organizations, to clean, preprocess, and analyze data.
Addressing practical challenges in analyzing medical data to derive actionable insights.
Predictive Modeling in Healthcare:
Building and evaluating predictive models to forecast healthcare outcomes.
Applying machine learning techniques to healthcare datasets for predictive analytics.
Computational Phenotyping:
Identifying patterns and phenotypes in medical data.
Using computational techniques to classify and group patient data based on medical characteristics.
Patient Similarity Analysis:
Implementing algorithms to measure patient similarity.
Using similarity analysis to support personalized medicine and treatment plans.
Scalable Machine Learning Algorithms:
Implementing online learning algorithms that can handle streaming data.
Conducting fast similarity searches on large datasets.
Big Data Analytic Systems:
Using Hadoop family tools (Hive, Pig, HBase) for distributed data storage and processing.
Implementing data processing workflows using Spark for large-scale data analytics.
Employing Graph Databases to analyze complex relationships in healthcare data.
Programming and System Skills:
Developing proficiency in programming languages such as Scala, Python, and Java.
Handling data using SQL and NoSQL databases like MongoDB.
Performing Extract, Transform, Load (ETL) processes on large datasets.
These concepts and experiences would have provided me with a comprehensive understanding of how to leverage big data in healthcare to improve patient outcomes, enhance healthcare delivery, and support decision-making processes. By covering practical aspects, the course equipped me with the skills needed to apply big data analytics in healthcare settings, and prepared me for real-world challenges in the industry.
Summary of Assignments and Group Project:
Assignment 1:
Solution and submission available on request
Objective and Quick Summary:
Objective:
The objective of this assignment is to apply data science techniques to healthcare data, specifically focusing on mortality prediction using clinical data. The assignment involves tasks such as data preparation, descriptive statistics, feature construction, and predictive modeling.
Quick Summary of the Assignment:
CITI Certification:
Complete the CITI training to work with the MIMIC database.
Descriptive Statistics:
Calculate basic statistics on the provided clinical data (e.g., event count, encounter count, and record length) for deceased and living patients.
Feature Construction:
Implement functions to process the clinical data, including computing index dates, filtering events, aggregating events into features, and saving the data in SVMLight format.
Predictive Modeling:
Build and evaluate predictive models (Logistic Regression, SVM, Decision Tree) using the constructed features, and report performance metrics (Accuracy, AUC, Precision, Recall, F-Score) on both training and test datasets.
Model Validation:
Apply validation techniques to ensure the robustness of the predictive models.
The assignment is designed to reinforce practical skills in handling healthcare data, conducting feature engineering, and building machine learning models for predictive tasks.
Assignment 2:
Solution submitted available on request
Objective and Quick Summary:
Objective
The objective of assignment2 is to analyze ICU clinical data to predict patient mortality within one month after discharge. The assignment involves implementing a logistic regression model and working with PySpark to process and analyze large healthcare datasets.
Quick Summary
Logistic Regression: Drive and implement gradient descent algorithms (both batch and stochastic) to train a logistic regression model. This includes deriving gradients, updating coefficients, and incorporating L2 regularization.
Descriptive Statistics: Compute various statistics such as event counts, encounter counts, and record lengths for deceased and alive patients using PySpark.
Data Transformation: The assignment requires transforming raw clinical event data into a standardized format suitable for machine learning. This involves filtering events based on observation and prediction windows, aggregating events, generating feature mappings, and applying normalization.
Feature Engineering: Work on creating feature vectors from the clinical data, mapping event codes to features, and ensuring data is normalized for effective machine learning model performance.
Implementation and Testing: The homework includes implementing these tasks in Python using the provided skeleton code, followed by running tests to ensure the correctness of the implemented functions.
Assignment 3:
Solution and submission available on request
Objective and Quick Summary:
Objective
The objective of assignment 3 is to implement both rule-based and unsupervised phenotyping algorithms to identify and analyze patient phenotypes, particularly focusing on Type 2 Diabetes, using Spark and PySpark.
Quick Summary
Rule-Based Phenotyping: Implement a phenotyping algorithm based on predefined rules from the Phenotype Knowledge Base (PheKB) for identifying Type 2 Diabetes cases and controls using patient encounter, diagnosis, medication, and lab results data.
Feature Construction: Develop features from raw healthcare data, such as the count of medications, diagnoses, and average lab test values, to use in clustering algorithms.
Unsupervised Phenotyping via Clustering: Perform clustering using K-Means and Gaussian Mixture Model (GMM) algorithms to discover groups of patients with similar characteristics, comparing these clusters with ground truth phenotypes.
Evaluation and Comparison: Assess the clustering results using purity metrics and compare the effectiveness of K-Means and GMM across different numbers of clusters.
Discussion and Analysis: Analyze and discuss the observed patterns from the clustering results, including how different clustering strategies impact the identification of patient phenotypes.
Assignment 4:
Solution submitted available on request
Objective and Quick Summary:
Objectives:
Implement Neural Networks:
Develop and experiment with various neural network architectures (MLP, CNN, RNN) on clinical data, specifically focusing on epileptic seizure classification and mortality prediction.
Data Handling:
Preprocess raw clinical data for use in neural network models, including tasks such as data loading, transformation, and feature extraction.
Model Evaluation and Improvement:
Train models and evaluate their performance using metrics like accuracy and confusion matrices. Explore methods to enhance model performance through architectural adjustments and other techniques.
Practical Application:
Apply deep learning techniques to real-world healthcare datasets, demonstrating the practical utility of these models in healthcare analytics.
Summary:
This assignment involves developing and refining neural network models to address two primary tasks:
Epileptic Seizure Classification:
Implement a Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN) using EEG data.
Focus on data loading, model building, and improving model performance. Evaluate models using learning curves and confusion matrices.
Mortality Prediction with RNN:
Use longitudinal electronic health record (EHR) data to predict patient mortality.
Involves preprocessing data, training a Recurrent Neural Network, and evaluating the model’s performance.
Deliverables include the submission of trained models, Python code, and a detailed report outlining the methods used, results obtained, and any improvements made to the models. The assignment emphasizes both the technical implementation of neural networks and their application to healthcare data.
Group Project : BDH Reproducibility Challenge
Challenge
Objectives and Quick Summary of challenge:
Objectives:
Reproduce Published Work:
Select a recent paper from a provided pool and attempt to replicate its main experiments, focusing on machine learning or deep learning in healthcare.
Assess the reproducibility of the experiments and the validity of the conclusions drawn in the paper.
Understand and Evaluate:
Gain a deep understanding of the selected paper, including its data, algorithms, and methodologies.
Evaluate how easily the paper's results can be reproduced and identify challenges or gaps in the original work.
Contribute to Reproducibility:
Provide a comprehensive report on the reproducibility process, detailing what was successful, what challenges were faced, and any deviations from the original study.
Offer suggestions to improve reproducibility in future research.
Summary:
The BDH Reproducibility Challenge is a group project within the CSE6250 course, where students work in pairs to replicate and evaluate recent published works in the field of machine learning and deep learning applied to healthcare. The project involves several stages:
Paper Selection: Teams select 2-3 candidate papers from a provided pool and ultimately choose one paper to replicate. The paper should be feasible for reproduction based on data accessibility and computational requirements.
Project Proposal: Teams submit a proposal summarizing their selected paper, explaining its importance, the data used, and the hypotheses they plan to test.
Project Draft: Teams develop initial code and run preliminary experiments, submitting a notebook on Google Colab that demonstrates their progress.
Final Report: The final submission includes a detailed report, code, and a presentation, assessing the reproducibility of the selected paper and discussing the outcomes.
The project aims to highlight the challenges of reproducibility in research and encourage best practices in conducting and reporting scientific experiments.
Paper Selection and Project Proposal:
Summary of paper selection and proposal:
Paper Reviews:
Three papers were reviewed as potential candidates for reproduction:
Paper 1: AI-Driven Clinical Decision Support: Enhancing Disease Diagnosis by Exploiting Patient Similarity.
Focus: Predicting health conditions based on patient similarity.
Challenges: Convoluted code, limited performance metrics, and insufficient details on experimental setup.
Paper 2: Real-world Patient Trajectory Prediction from Clinical Notes Using Artificial Neural Networks and UMLS-Based Extraction of Concepts.
Focus: Predicting clinical trajectories (diagnosis, mortality, readmission) using UMLS-based concept extraction from clinical notes.
Advantages: Readily available data and well-documented code.
Challenges: Time-intensive preprocessing.
Paper 3: Comparing Deep Learning and Concept Extraction Methods for Patient Phenotyping from Clinical Narratives.
Focus: Phenotyping patients based on clinical narratives using deep learning.
Challenges: Missing pre-trained embeddings, broken code links, and difficulty in annotation linking.
Selected Paper:
Paper 2: "Real-world Patient Trajectory Prediction from Clinical Notes Using Artificial Neural Networks and UMLS-Based Extraction of Concepts" was chosen for reproduction due to its clear methodology and accessible data,.
Reproduction Plan:
Hypothesis: The correct selection of UMLS concepts based on similarity and TUI thresholds can improve patient trajectory prediction.
Scope: Focus on foundational steps, such as preprocessing notes for UMLS CUI extraction, and conducting key experiments like training a Feed Forward Network for diagnosis code prediction.
Challenges: Computational intensity and potential long training times. Mitigation involves using GPU resources and, if necessary, applying statistical methods for comparison.
Risks and Mitigation:
The primary risk is the computational intensity of the experiments, particularly the need for prolonged GPU usage. As a mitigation strategy, the team plans to use Google Colab for GPU access and consider alternative statistical methods if computational demands are too high.
Link to original paper
Submitted demo colab notebook and github codebase available on request
Final Demo Colab Notebook
Final report submitted
Summary of report:
The report focuses on reproducing the results of a research paper titled "Real-world Patient Trajectory Prediction from Clinical Notes Using Artificial Neural Networks and UMLS-Based Extraction of Concepts."
Key Sections:
Group Information:
Team ID: B4
Members: Amber Gupta and Ayush Parikh
The project reproduces a paper published in the NLM-PMC in June 2021 by Jamil Zaghir and colleagues.
Reproduction Implementation:
Demo Notebook: A Google Colab notebook was created and shared with TAs to reproduce the results.
Presentation Deck: The presentation material used during the project is hosted online.
Presentation Video: A video summarizing the project is also available.
Codebase: The project's code is hosted on Georgia Tech's GitHub.
README: A README file is provided to guide the TAs through reproducing the results.
Additional Observations:
The demo notebook includes comprehensive coverage of the comments, figures, and equations relevant to the project.
It is collection of references to various project materials, such as the code, presentation, and video, necessary for the TAs to evaluate the reproducibility of the original paper's results.
Final presentation
Presentation highlights:
Introduction & Motivation: The project aims to predict clinical trajectories using admission diagnosis data, addressing the challenges of processing unstructured text data. The motivation lies in the application to preventive medicine and clinical recommendations.
Problem Setup: The selected paper uses UMLS concepts and neural networks to predict patient trajectories. The innovation includes filtering and representing clinical text for better generalization and performance.
Methodology: Data is sourced from MIMIC III and processed for UMLS extraction. Models, including feed-forward networks (FFN) and RNNs (GRU), are trained using this data.
Results: The results show that tuning the model's parameters, such as the similarity threshold for CUIs, leads to improved performance in disease prediction. FFN models generally outperform RNN models in this setup.
Reproducibility Assessment: The study reproduced most of the paper’s results but faced challenges due to processing demands. The reproducibility is deemed possible but requires extensive preprocessing.
Conclusion & Future Work: The presentation concludes that while the paper is reproducible, further optimization and experimentation with additional datasets and model variations are necessary for better results.
The presentation ends with a discussion of potential future work, including experimenting with different data sets and model configurations to enhance performance.