So You Want to Do a Data Science Project, Now What? (Part II)

This article follows So You Want to Do a Data Science Project, Now What? (Part I), where we covered the key knowledge areas and the important questions to ask to discover all you need to know about your potential data science project. We now look at the “raw materials” you need to work with—all the data to be gathered (sometimes from different places), assembled, and staged for conversion into information for decision support.

Bringing all the raw materials together for a data science project reminds me of watching my Mom plan a big holiday dinner. She knew all the phases, where to get the necessary ingredients, the processes of converting the ingredients into different dishes, how to stage it so the appropriate things occurred in the right order—and so there were leftovers at the end for meals yet unplanned.

Drawing a comparison between my Mom’s approach to a data science project is not so far-fetched. It can help you create a high-level framework for the work to be done that you can “feed” to the team who will take the project on—whether that team is internal, external, or a hybrid combination.

Prep the Data So It “Cooks” Just Right

The data will likely need to be prepped, then placed together into one pot (aka a data lake) in a particular way and “cooked” just right. This is the first modular component of a data science project. You gather the required data from all the disparate sources, then format it properly—sometimes using Extract-Transform-Load (ETL) technologies—so it can be processed.

The second piece, which is the analytics technique, will have some impact on the first part. That’s because you might select an off-the-shelf analytics tool that expects data to be in a particular format and organization. But even if the analytics tool is built completely from the ground up, the team will still need to coordinate how things are gathered and assembled in the data lake so the analytics tool can function in a repeatable way.

Furthermore, this second piece of the process requires that the data be checked for validity. You don’t want to cook with bad or out-of-date code ingredients!

Make the Data Easy to Consume

Once the information is gleaned from all the raw materials, it needs to be served up in a consistent way so that it is easily understood and consumed by decision-makers. This can come in the form of a static data product (graphs) or integration into a BI dashboard that allows slicing and dicing of the data to hunt down support for specific decisions.

Next, there has to be an augmented business process, in which decision-makers sit at the table with their “servings” of decision support and collaboratively make smarter business decisions based on the information. This means changing the way things are done. There must be a new process that includes the new information so better decisions can be made in a specific business-relevant way.

Finally, there should be discussions about the efficacy of the data science tool. This includes seeing the value it brings to the business and using principles of continuous improvement to keep the good results moving forward and ahead of the competition. At this stage, there should also be clear connections between the data science project and the achievement of specific business outcomes.

The Magic of Centralizing Raw Data

A large medical system provides an example of the magic that can happen when raw data from multiple sources is brought together. In this case, centralizing information was an enabler for data science, and the sharing of information and insights into existing medical workflows made the overall organization smarter.

Starting out, the organization wanted to update its electronic medical information system. As part of that update, there was some significant work done to acquire and store results from a broad array of hospital medical devices into one central location so practitioners could quickly access electronic health records.

This was initially an attempt to lower medical costs and eliminate paperwork, both of which were achieved. In addition, because all the data was nicely organized in a central location, it also enabled research into anonymizing patient information to perform data analysis on aggregate vitals, diagnosis codes, imagery, treatments, and results.

This was the first time the organization could realize this goal because the information was no longer scattered across several different medical systems. By sharing the information with medical practitioners—rather than only hospital management—not only was there a harmonization of treatments, but also an overall improvement in treatment efficacy as well as a reduction in hospital stay time.

Unintended positive consequences occurred because the information was integrated into several professional workflows, allowing shared knowledge to drive smarter decisions. This effort has now evolved to help doctors achieve better insights toward patient treatment options while also protecting individual patient privacy.

Drive Requirements by Following Project Stages in the Correct Order

To meet the specific requirements of your data science project, you will need to have conversations with your data science team (regardless of where the team resides) during each of the project stages. The requirements should take things from a high-level design to the detailed design elements to make sure each building block is taken care of—and in the correct order within the overall flow of the project.

Here are the steps to follow…

  1. Have a clear image of “what a data science win” looks like; ask, “What is it I’m trying to do smarter and how does that improve our business?”
  2. Gather, prepare and stage the data.
  3. Make sure the data is complete, valid and fresh.
  4. Convert the data into information with the appropriate analytics technique(s).
  5. Deliver the information in a properly-digestible data product (graphics/text, slicer/dicer) that supports the decision process.
  6. Implement a new process in which the prepared information is consumed and acted upon in the proper cadence, either individually or in collaboration.
  7. Validate the data science process continuously and its ability to attain the business objectives.

Following these steps in order is critical: You don’t want to serve the dessert before the main course!
Knowledge of the overall project and how the pieces interact will also ensure you get the most out of the team, particularly if there are multiple teams working on one overall project.

In our next article, So You Want to Do a Data Science Project, Now What? (Part III), we go more deeply into each phase so that you will feel prepared for the conversations with the project team leaders responsible for each of these pieces. In the meantime, I also welcome the opportunity to hear how your data science project is going and would be glad to assist you in finding the answers to any questions you have. Feel free to reach out to me at henry@tiempodev.com.