Tuva Health and the future of claims analytics

how open-source tools, community and dbt can help centralize healthcare data knowledge and best practices

Aug 07, 2022

Ah, medical claims. What are they, exactly? It’s complicated.

Know this: in order to get paid, healthcare providers and facilities have to submit claims to insurance companies. Insurance companies are also known, fittingly, as payers.

Claims are itemized bills, with detailed information on the procedures performed, diagnoses recorded, healthcare costs, insurance, and much more.

Thanks to the level of detail they contain, large claims datasets can be an interesting resource for lots of stakeholders across the healthcare system, from researchers to digital health startups.

Analyzing claims data is hard. Datasets are notoriously noisy, messy, incomplete, inconsistently structured, and hard to understand. Done right, however, claims analytics can help answer questions like these:

Spend and utilization: what are we spending and what are people using?
Risk adjustment: how does a person’s health compare to “average health” in a population?
Quality: are there identifiable patterns that yield insights into individual provider quality?

The answers to these questions can help uncover variability and opportunity—focus areas and levers for healthcare stakeholders. These levers help sharpen healthcare interventions and enable better, lower-cost healthcare for more people.

So how do we get from point A (messy, historical claims data) to point C (shiny, sparkly insights)? By way of point B: a clean dataset.

Getting to point B is the hardest part. The data processing, cleaning, and normalization journey between points A and B never looks quite the same. The complexion of the journey depends largely on the question and the dataset at hand.

For part of a foray into provider quality, you might need to translate pieces of claims data into information about good and bad health outcomes. I’ve found that the right answers to complex questions like these often lie in the hands of people who have done this work for decades.

Enter Tuva Health, a company that’s centralizing this essential knowledge by building community around open-source tools to enable claims (and other healthcare data) analytics.

As others have written, including folks from Tuva, the Tuva Project packages together a few key elements:

Data validity tests: think of these as your GPS on the way from messy data point A to clean data point B. Are there duplicate claim lines in your dataset? Are there invalid bill type, revenue, or place of service codes bouncing around in there? Tuva’s data tests will help you think about your data, its issues, and how to fix them.
A common data model: raw data sources are often structured differently depending on the dataset. A common data model helps establish a standard data format as a jumping off point for all downstream analytics.
Terminology: frequently, health data scientists and data engineers have to search for and/or create custom datasets to augment analyses (e.g. pulling down a list of valid ICD-10 diagnosis codes). Everyone working on the Tuva Project operates from the same source of truth.
Concepts and data marts: concepts refer to standard rules and methodologies that are central to analytics, like hospital readmissions and chronic conditions. Data marts are the code that processes the necessary upstream data in accordance with concepts.

There are plenty of open-source healthcare data frameworks out there. So many, in fact, that I could spend years writing a book about them all (here’s a good, long list). At its core, the Tuva Project is an open-source effort focused on healthcare data analytics, and there are only a few other open-source healthcare data analytics projects out there.

The one most similar to the Tuva Project is called the OHDSI initiative (pronounced “odyssey,” and short for Observational Health Data Sciences and Informatics). Launched in 2014, OHDSI is developed and maintained by a community of researchers all over the world. OHDSI is home to vocabulary datasets, a common data model, and concepts built on top of that data model.

The OHDSI initiative is similar to the Tuva Project, and it’s been around much longer. So what is Tuva doing better, or differently?

The Tuva Project is a much better fit for the modern data stack.

Let’s back up: large-scale data analyses often depend on data pipelines, which are typically structured according to ETL (extract, transform, load) or ELT (extract, load, transform) frameworks.

OHDSI’s software is built using multiple different programming languages, namely Java and R, and best supports an ETL framework. The Tuva Project’s software is built entirely in dbt, a tool that helps enable rapid data transformation and fits best in an ELT framework.

As modern databases have become more powerful, it’s also gotten easier to load data and then transform it in-warehouse (ELT) rather than transform data out-of-warehouse before loading it (ETL). So ELT pipelines have become more common, and dbt handles the T (transform) piece of an ELT data pipeline.

If you think of data as sand, and a completed data analysis as a sandcastle, dbt projects would be your buckets and shovels. It’s possible to build a sandcastle with your hands, but it would take a lot more time. Your sandcastle would probably look better if you just used buckets and shovels.

Tuva is focused on building really good buckets and shovels for people building data sandcastles in healthcare. Tuva’s data marts are structured as dbt projects, made up of .sql and .yml files. Another perk to using dbt: because most of the code powering the transformation is SQL queries, these projects aren’t too far from plain English—and are more accessible to people with non-data science backgrounds than, say, R or Java code.

As a brief aside, another open-source healthcare data analytics project worth mentioning is OpenSAFELY, which focuses on EHR data in the UK. Very cool, but not (yet) the most comparable to what Tuva is doing with healthcare data in the U.S.

Recently, Tuva released a claims preprocessing engine, which maps raw claims data into encounters (hello, point B!). Even more recently, they announced the release of five free synthetic datasets in partnership with Syntegra.

As the push for health data interoperability continues, the time is now to experiment with Tuva’s open source tools. With greater access to healthcare data will come broader access and more complex applications. The Tuva Project presents an opportunity for companies to operate from a set of fundamental, agreed-upon, best-practice data concepts and building blocks: dbt, version control, and core concepts help establish good data hygiene and infrastructure from the start.

Reservations about retaining and creating intellectual property can prevent companies from adopting open source tools. Beyond these building blocks, there’s still ample room for customization and subsequent innovation—companies need only fork a repository and customize away.

Good data infrastructure, solid data concepts, and powerful, customizable code are a recipe for robust analytics and compelling insights. As Tuva Health and members of its community continue aggregating knowledge and best practices that have been disparate for decades, healthcare data and analytics will be better for it.

Many thanks to Brendan Keeler, Rachel Menon, Joe Mercado, Rik Renard, and Jan-Felix Schneider for their edits and suggestions. If you read this article and you want to chat, I’d love to hear from you! You can find me on @krishmaypole on Twitter. I’d also love to connect on LinkedIn.

Point of Care

Discussion about this post