Capitol Tech sat down with Pam Phojanakong, an epidemiologist with a PhD from Drexel University, to discuss the challenges facing data scientists in the midst of a global pandemic. Pam currently works as a data scientist for CORMAC, a Baltimore-based data management and analytics firm, supporting a contract that works with data submitted by post-acute
Capitol Tech sat down with Pam Phojanakong, an epidemiologist with a PhD from Drexel University, to discuss the challenges facing data scientists in the midst of a global pandemic. Pam currently works as a data scientist for CORMAC, a Baltimore-based data management and analytics firm, supporting a contract that works with data submitted by post-acute care providers. Pam was instrumental in developing a COVID dashboard, which looks at COVID rates compared to Census data for key occupations and social characteristics.
What are the biggest challenges surrounding data and COVID?
Analysis is an iterative process. With all hands on deck for the pandemic, what we know is changing from day to day and week to week, which makes iterative processes difficult to sustain. I have had the luxury up until now of having a good handle on what the trajectory of the outcomes that I’m working on are going to be – what “normal” changes look like, but with COVID, there is no normal.
It’s nice that people are mobilizing resources and forming collaborations to share analyses, but it’s hard to make sense of the noise because there is so much data coming out at once. Many times, articles and information are shared, but in the end it all comes back to the same data source.
Because it’s such an urgent crisis, everyone is making decisions in an emergency state which isn’t the best framework for analytic thinking.
Data for nursing homes is a great example. The Centers for Medicare & Medicaid Services (CMS), which oversees the nation’s 15,000+ nursing homes, wanted to make nursing home COVID data as readily available as possible. To do that, they let nursing homes to roll up their COVID cases through May 24, instead of having to go back and retrospectively report and separate counts. It’s obviously less labor and resource intensive and it gives us a starting point quickly, but that doesn’t provide the specific “when”. Say a home reported 45 cases. When did those 45 cases happen? It’s important to look beyond the fact that there are 45 cases as of May 24.
What do you think data analytics students can learn from COVID?
Everything is moving at a pace much faster than in a normal time. You are living through a simulation study – but it’s real life.
The more you learn the more you learn what you don’t know. In the general sense, as you get more experience in analytics, you learn about the shortcomings of the data you’re working with and, in hindsight, that there is almost always a better approach than the one you landed on. With COVID, hindsight comes up much faster. COVID has exposed our vulnerabilities at every stage – our ability to get data is dependent on small, local, territorial offices. It’s not a top-down effort, but we make decisions in a top-down manner. We think we have a great IT infrastructure, but then we discover not every health provider has equal infrastructure to support reporting or resources to devote to reporting.
I think one of the biggest lessons is that there’s not going to be a single thing that’s untouched by this. From the epidemiology standpoint, we talk about causality, necessary vs. sufficient causes, confounders, mediators, etc. For a lot of analytic work, there’s an implied search for a root cause or constellation of factors, to distill complex processes into a neat table of results and be able to say “these variables are driving your outcome and there were some things that look like they were related, but they’re really not.” Everything is related now. Having come from academic research studies to working with CMS, you find everything is interdependent. You think if you have a “clean” data set, and your model holds, then you’re fine, which isn’t necessarily true.
People are putting a lot of stock in the quality of data but not questioning what is coming in, what gets missed, or what the assumptions are. This is probably the only time a student gets to live through a teachable moment when we’re all learning together, watching assumptions being made and really experiencing the consequences of those assumptions—right or wrong.
To use the example of machine learning – I am learning as it is learning. We are learning as the data evolves. Normally, you think of it the other way around. We train the algorithm. Right now, I think we are being trained and informed by our models.
How do you determine which data is the best to use?
It’s important to understand the limits of the data you have, what you can and cannot do. For the CORMAC dashboard, the intention was to combine social determinants of health factors with the geographic distribution of cases in the DMV area, but once we realized we were not going to get patient-based data on individual cases, we pivoted to a place-based approach to the issue: what area social characteristics are associated with county burden of disease? The county level was the smallest area I was willing to work with that I trusted; we had access to reliable information and we were also on solid theoretical ground — where you live impacts your health. It’s a similar mindset with using the nursing home data to try to build a predictive model. It’s not at the patient-level, which would be the most informative if you’re thinking about cases, but it is at the nursing home level, which may help CMS understand which providers are going to be vulnerable to an outbreak and shed light on measures and protocols that are especially important in a crisis. Working through assumptions and coming up with solid reasoning is key.
There is also a soft skill that can be hard to teach and grasp – understanding there’s more happening than what’s on the page. This is a great working case of that. Everyone is watching dashboards and the numbers on the news, but not thinking through what’s driving the numbers. If you’re really curious and wanting to be the best analyst you can be, you have to know what you have and what you don’t have. You have to be able to identify what is missing so you can a) go get it and try to fill in those gaps or b) if you can’t find it, then you need to know the limits of your model and what conclusions you can or cannot make.
What impact does COVID have on those pursuing a career in data analytics?
In some ways, the pandemic and has brought out the best in analysts and analytics. The number of civic hackathons is going up and it’s heartwarming to see people sharing their code and resources, from large firms to individuals who are just really passionate. People are sharing their insights in real time and I personally have learned a great deal in a very short time span and from people I might not have ever interacted with. I hope it encourages women, including women of color, to take up the data sciences. I hope it inspires people.
Want to learn about analytics? Capitol Tech offers bachelor’s, master’s and doctorate degrees in analytics and data science. Many courses are available both on campus and online. To learn more about Capitol’s degree programs, contact [email protected].