So You Want to Do a Data Science Project, Now What? (Part III)

Remember that quote from the movie Money Ball? The Oakland A’s general manager Billy Beane (played by Brad Pitt) says, “I think the question we should be asking is…Do you believe in this thing or not?”

His assistant, Peter Brand, played by Jonah Hill responds, with confidence, “I do” which is exactly how you want to feel when your data science project starts generating results!

In the first two articles of this series, we discussed the key knowledge areas and the important questions to ask to discover all you need to know about your potential data science project. We also looked at the “raw materials” you need to work with —all the data to be gathered from different places and then assembled and staged for conversion into information for decision support.

To finish the series, we provide a quick view into the components of a data science engine. Paying close attention at each step of the project will give you the same level of confidence as Peter Brand when your solution starts producing results.

The 6 Key Components of Successful Data Science Projects

By successfully completing each of these six steps, your organization will have a higher level of trust in the results of your data science project:

  1. Identify the key business question(s) you need to answer.
  2. Organize the “rodeo” where you collect all the data.
  3. Turn the analytics loose for business users to consume.
  4. Make the data into nice “eye candy’ so it’s easy to consume and gets adopted.
  5. Develop the proper organizational structure to digest and act upon the data.
  6. Allocate a budget for care, maintenance, and continuous improvement of the data science solution.

Taking these steps will ensure your analytics asset stays calibrated and groomed to deliver more and more value to your business over time.

How Do You Want to Make Your Business Smarter?

By taking the first step above, you avoid jumping from the forest of your data science project right into the trees. Be sure to review the primary question: What is the decision that you want to make smarter in your business? Everything else will flow from here.

Once you know what the question is that you want to answer, that should point you to the second step—the likely suspects to round-up for data sources. If it is manufacturing efficiency or quality, for example, the data may live in production process control, QA, customer satisfaction, or standard cost data.

You’ll then need to consider the method for obtaining the data from all the necessary sources, converting it to the proper format, validating it, and staging it for consumption by statistical and analytics tools. Is the data on a physical server or enterprise database? Or is it stored on a closed application platform (mainframe) or in the cloud in a proprietary database?

Look Behind the Eye Candy

In step four, some people get excited about the eye-candy and analytics, but most will admit, the trickiest part of a data science project isn’t what happens in the data lake—it’s what happens in the data “canal locks” that flow into the data lake. There are likely to be multiple data sources feeding the aggregation of data from which information is refined, each with its own API or infrastructural requirement.

Then there are issues of who owns the data and whether there is a cost to obtain the data. The cost may be based on $ per query, per MB, or per month. Also consider the cost of the transient storage for the data lake (server or cloud) and any security issues that need to be employed for compliance or privacy requirements.

Next, the question becomes, how often do you need to make the decision, and what is the acceptable time to deliver an answer once the question is asked? If your end-users want to ask the question at any time of day and don’t want to wait, you will need to cache the data to avoid query latency.

This means storing the data and writing policies about how frequently it should be refreshed to maintain validity. If the question your data science project is making smarter is already part of an existing BI dashboard and latency is less important, then it can be a scheduled query that feeds a dashboard or a report.

When Refining Data, Don’t Fear the Python!

Depending on the types of information to be gleaned from the data, there are different analytics tools and machine learning algorithms to consider. Each has its own requirements and API skills. Some are open source and some are pay-as-you-go.

You will also start hearing about people with expertise in Python (have no fear, it is a software language, not a reptile!). Find someone well-versed in the application of these tools and techniques to refine the type of data you have into decision-support information that stays calibrated over time.

This is different than database analysis skills, which involve structuring the data. It’s more about the mathematic techniques required to find rhythms in the chaos. For your IT people, this means an atypically-heavy computational workload that has to be delivered upon the assembled data. The output then needs to be delivered to some sort of data product (graphics or text) that assists in making the targeted decision smarter.

The data product could be static (a fixed Excel graph) or active graphs that allow the decision support team to drill-down into a high-level summary of the data that allows them to focus on one particular pre-selected variable—such as geography, time slot (month/quarter), customer type, or product line. Users can also slice-and-dice to get a high-level summary that allows for tuning of the analytics and for changing multiple variables, such as looking at all customers of a certain age or living in a certain zip code. These capabilities will highlight a particular aspect of the information and what is driving the decision, one way or another.

Growing a 300-Pound Gorilla: Data Science in Action

An example of a successful data science project is exemplified by a decision-support tool for a capital investment firm. The tool compares and contrasts different industry verticals that don’t have a 300-pound gorilla company controlling the market.

The firm wanted to identify a potential company in these verticals as an acquisition integration platform and grow that gorilla. Not only did we need to acquire data (which existed in several different places in several different formats), we also had to normalize the data into an apples-to-apples format in order to make the financial comparative models work.

The Money Ball example referenced above, in comparison, was easier. The player statistics had high fidelity. In this case for the investment firm, we had to create a set of metrics on which to slice-and-dice the candidate gorilla’s company’s financial reports. Even once the models were running, we had to decide which decisions were truly predictive and how to introduce this information into the firm’s investment decision workflow—so they could absorb, discuss, and evaluate options in a collaborative manner.

Even after the business sectors were identified for potential investment, there was a good deal of guidance that came from the team’s own personal network of connections, which allowed them to choose paths of least resistance from a roster of good quality investment suggestions. This was key because, in the end, data science can’t pick up the phone and close the deal. People are required, so we needed to arm them with smarter options.

This effort clearly paid off: end-users have repeatedly identified new verticals as well as solid platforms for acquisitive growth. The tool has evolved somewhat too, now including more automated data governance.

After You Generate the Analysis

For your data science project, getting to the point where you generate the required information to make smarter decisions is obviously critical. But there also has to be an organizational process and cadence that synchronizes with the production of the data (step five). Only then can the organization (or customers) make the smarter decision at the desired time or perhaps in real time.

If it is a client customer, the data and the impact upon their decisions need to be clear. If it is an internal business process (such as reports for senior management), the consumers of the data have to be properly trained in how to interpret and act upon the information. There has to be someone responsible for supporting the decision-makers!

Behind all of this, you need people to husband the data, making sure it stays pure, valid and fresh as part of maintenance (step six). Furthermore, as the system is used, there will be continuous improvement to drive deeper thinking and better inferences, which will define your company’s competitive edge going forward.

Be watchful—you might be surprised at the new things that the numbers will whisper to you. And sometimes, there are amazing positive and unintended consequences!

I hope you have gained some helpful insights from this series covering the key elements that go into a data science project. I welcome the opportunity to hear how your data science project is going and would be glad to assist you in finding the answers to any questions you have. Feel free to reach out to me at