The World Economic Forum estimates that hospitals generate a staggering 50 petabytes of data per year. That data has a ton of potential, but the sheer amount has a paralyzing effect. It’s no surprise that 97% of it sits untouched.
Consultants often promote AI as a blanket solution for bad data, but the reality is that bad data = bad AI implementations, no exceptions. And in healthcare, that doesn’t just mean revenue shortfalls. It could mean terrible implications for patient well-being.
Bad data stems from data siloes, poor data quality, privacy concerns, and technical barriers. Fixing this bad data starts with a process for cleaning it up, and a platform like Snowflake that makes it clean, secure, and accessible.
With so much on the line, it’s essential to get the fundamentals right. We’ll break down the sources of bad data, the signs of good data, and a Snowflake roadmap for good AI in healthcare.
The Problems
Data Siloes
A significant part of the AI challenge in healthcare arises from the fact that most hospitals have multiple platforms that don’t communicate with each other. The average hospital’s typical tech stack includes:
- EHRs that serve as the primary source of patient information
- Credentialing systems that verify physician qualifications, history, and competency
- Lab Information Systems that store test results
- Radiology Information systems that store medical images, such as X-rays and MRIs
- Pharmacy Systems that store medication history
- Billing and claims systems that store financial information
- Internet of Things and medical devices that store personal healthcare data
- Public health data sets from government agencies
Each of these stacks contains structured, semi-structured, or unstructured data. They often use different types of data formats, coding standards, or terminologies. That means data exchanged between systems will often be misunderstood, incomplete, or require a ton of manual effort to standardize.
Each system also contains data for multiple constituents, including patients, employees, physicians, and more. Each of those categories isn’t mutually exclusive. For instance, employees and physicians are also patients. These complexities blur the line between data types, making AI-readiness even more challenging. It also makes a good foundational strategy for data even more critical.
Data Quality
Data quality issues stem from processes. Manual data entry leads to typos, abbreviations that don’t make sense to everyone, and other inconsistencies that AI will misinterpret. Also, data from lab or radiology systems is notoriously difficult for AI to process because it’s unstructured. Without natural language processing (NLP) techniques, that data could go underutilized or misinterpreted.
Data gaps are also present across all systems. Patient records in some systems may be missing important details that are available in others. For instance, the EHR may lack crucial prescription information from the pharmacy platform, omitting details that could aid physicians in delivering better care.
Regulatory Concerns
Healthcare is one of the most regulated industries. These regulations add a layer of complexity for new technologies like AI. For instance, HIPAA imposes strict rules on how protected health information is stored and shared. Additionally, the question of who is ultimately responsible for AI-generated diagnoses means that hospitals require a framework to make sure that AI is using data securely and ethically. Beyond that, AI needs to be compliant with federal and state laws.
Technical Barriers
Even with perfectly integrated systems, high-quality data, and well-regulated frameworks, technical barriers can still hinder AI implementations. Some of this comes down to staffing issues. Many hospitals lack the data scientists, machine learning experts, or analysts needed to build and deploy AI models. The cost of onboarding, training, and maintaining a team of experts can quickly balloon out of control. Beyond staff, healthcare is notoriously resistant to change.
The Signs of Good Data
Now that we’ve addressed the problems that lead to bad data, let’s talk about good data. More specifically, how can you tell when your data is good? Good data can feel like it’s tough to define, but there are characteristics you can look for. Good data is typically accessible, diverse, governed, and accurate.
Let’s go through each characteristic.
Accessible
Accessibility means data gets to the people who need it quickly and easily. It’s not just about having access; it’s about how effortless it is to find. That’s where context and metadata come in, providing the clarity and structure that make locating data easier. Stakeholders can find the information they’re looking for in multiple ways, by type, date, and other categorizations. Accessible data means better clinical decision making, improved care coordination, and more efficient processes.
Diverse
At the individual level, data diversity provides a 360-degree profile of each patient, including medical, demographic, genetic, behavioral, and environmental factors. Viewing all these factors in one place is crucial to creating a complete picture of the patient to deliver effective care.
Governed
Governance means that processes, policies, and roles control the entire lifecycle of a piece of data. It’s all about the “whos” and the “whats” for who can take what action, with what data, and under what conditions. Proper governance keeps your data reliable, reduces risk, and creates more trust over time.
Accurate
Accurate data is the backbone of patient care. It’s more than just getting the diagnoses, treatments, and records correct once. It’s about keeping them up to date on an ongoing basis. After all, timeliness is everything in healthcare. Data can change quickly, and to stay accurate, updates must happen immediately. Every system, every record, and every touchpoint must reflect those changes in real time.
Best Practices for AI Data Readiness with Snowflake
Now that we’ve addressed the data problems that lead to bad AI and the characteristics of good data, let’s take a look at how to start your data readiness journey with Snowflake.
Here's 7 best practices to keep in mind.
1. Assess and Clean Data
-
Accuracy
Ensure all data paints an accurate patient profile -
Diversity
Fill in missing information and remove old records. -
Timeliness
Ingest data as close to real-time as possible to keep AI models current. -
Standardization
Use consistent formats, code, and naming conventions across all sources.
2. Integrate Data Sources
-
Centralize Data
Use a customer data platform like Snowflake to ingest structured, semi-structured, and unstructured data from EHRs, lab systems, claims, marketing, and wearables. -
Interoperability
Map and harmonize fields across systems using tools like Fivetran, Mulesoft, or similar integration tools. -
Identity Resolution
Use entity resolution to create unified patient, employee, and provider profiles.
3. Establish Strong Data Governance
-
Governance Framework
Create policies for data entry, lifecycle, retention, and removal. -
Access Controls
Implement role-based permissions, audit trails, and field-level security for PHI. -
Compliance
Meet HIPAA, SOC2, and ISO27001 standards using Snowflake’s native security and encryption.
4. Monitor Data Quality
-
Continuous Monitoring
Set up automated checks for data errors and duplicates. -
Feedback Loops
Review AI outputs to catch data issues before they become systemic problems. -
User Training
Educate staff on healthy data entry and validation.
5. Eliminate Bias
-
Bias Detection
Analyze data for representation gaps or historical biases before model training. -
De-identification
Remove or redact sensitive information as needed for model training and compliance. -
Auditability
Maintain logs of data transformations and model decisions for transparency.
6. Iterative Implementation and Change Management
-
Start Small
Pilot with high-quality, high-impact datasets and expand as data maturity improves. -
Stakeholder Involvement
Engage IT, clinical, and business teams in data readiness and AI use case definition. -
Continuous Improvement
Use feedback from pilots to refine data pipelines and governance.
7. Leverage Snowflake’s AI and Analytics Ecosystem
-
Native AI Tools
Use Snowflake’s Snowpark, Streamlit, and partner integrations for scalable analytics and AI. -
Data Sharing
Enable secure, governed data sharing with partners for collaborative AI initiatives. -
Automation
Automate routine data prep and transformation tasks to reduce manual errors.
A Tested Roadmap for AI Data-Readiness with Snowflake
Here’s a sample roadmap we use at our healthcare customers to help them get ready for AI. We typically recommend Snowflake for these use cases because it does a fantastic job of breaking down data siloes, maintaining data quality, eliminating privacy concerns, and removing technical barriers. It might seem like a front-loaded process, but establishing a solid data foundation is crucial for ensuring AI's success.
-
Discovery & Assessment
Inventory data sources, map data flows, and identify gaps. -
Data Cleansing & Standardization
Clean, deduplicate, and align data formats. -
Integration
Centralize data in Snowflake, harmonize schemas, and resolve identities. -
Governance & Security
Set up access controls, compliance checks, and monitoring. -
AI Enablement
Prepare data for model training, implement bias checks, and validate outputs. -
Iterate & Scale
Pilot use cases, gather feedback, and expand to broader datasets and departments.
Sure, the prospect of AI in healthcare is exciting. But the quality of your data dictates the quality of your AI. By following the best practices above, you’ll be well on your way to a successful, safe AI deployment.
Need Snowflake Help?
We make your data ready for AI with Snowflake
Penrod is your go-to healthcare implementation partner, bringing Snowflake to life by turning your data into action.
Learn More →