Machine Learning/Data Science

Document created by gjsissons on May 11, 2017Last modified by Chris Jennings on Aug 15, 2017
Version 13Show Document
  • View in full screen mode




While not "new" technologies the capabilities of machine learning, proliferating to all levels of developers and businesses, coupled with the unprecedented ability to collect and store data, create an explosive peanut butter and chocolate combination.  United, the two are driving new technologies such as:  voice first, chat bots, robotics, virtual and augmented reality, and fraud detection to developers of all abilities and industries.  It is now as easy to parse the spoken word or perform fraud analytics on payment transactions as it is to parse a simple comma delimited text file.  Machine learning and data collection, over the last few years, have formed the foundation of these new trends and now additional layers like Tensorflow, Watson Analytics, and Azure machine learning are abstracting the technical details even further.  The time is ripe for anyone, of any background, to leverage machine learning to add business value from data.


Use cases

The use cases for machine learning are unlimited.  All it takes is a bit of data, some creativity, an algorithm, and some daring.

  • Fraud Analytics
  • Why certain customers tend to spend more than other customers.
  • Image Analysis
  • Sentiment Analysis


"Our Project"

The most excellent aspect of the world of computers these days is that the opportunity to learn is almost always a google search away (which by the way is also an excellent use of machine learning).  For our project we are going to leverage a learning resource, and all around cool place to hang out for Data Scientists, called Kaggle.  Visit the site here:  Kaggle: Your Home for Data Science


Even better is that our friends at Kaggle host freely available datasets that allow the entire community to learn and participate in data science/machine learning enabling people that would not necessarily have access to such data.  If that's not enough many community members then apply machine learning algorithms to the dataset which others can then reproduce.  We will attempt to reproduce some of the results leveraging a dataset near and dear to our hearts:  credit card fraud.


The fascinating thing about this dataset is that it's publicly available.  Granted it does not provide the underlying data but we'll do some inference based on the industry and take some best guesses on what the underlying data might contain.  At the same time we'll learn about using R (a statistical analysis tool) and some of the fun things you can do with R, machine learning, and credit card data to help prevent fraud.


First let's take a look at the data.  Browse on over to: and download the data.


I have highlighted in yellow the link to click to obtain the data.  While you're on the Kaggle site read a bit about the data and it's format.  The gist is that due to confidentiality that have been transformed with PCA so unfortunately we do not get to see masked card data, merchant specific information, or other annotations we can make an informed decision that some of those data features were likely in the underlying data.  In addition another fascinating output of this data, and credit card fraud, is that the data is highly unbalanced meaning performing a simple 'predict all transactions as not fraud' algorithm will score you very high in accuracy but obviously your results would be useless as we're trying to predict fraud!


The next special attribute of these datasets is that anyone can implement a Kaggle Kernel and show inline how they analyzed the dataset and then other users can easily fork that Kernel and/or comment directly to your Kernel.  Basically this is an excellent way to reproduce other people's analysis and then try to reproduce/understand that analysis yourself, especially if you are not a classically trained Data Scientist with a large statistical/machine learning background.


Here's a snapshot of a Kernel written in a Python notebook analyzing the credit card fraud data:



Here is another notebook with analysis in R



Now that you have the data downloaded to your computer utilize your favorite statistical analysis package to try your attempt at finding fraud in the dataset and then upload your analysis to Kaggle so that everyone can learn more about preventing fraud.


Sharing Private Data on the Horizon?

After performing your analysis contemplate the issues involved with providing this data set to the public.  Being a good data steward means considering privacy implications when sharing data and/or posting data in a public forum.  Unfortunately it is extremely difficult to share this type of information publicly while still retaining the meaning behind the data set.  While we can play with algorithms and predictions, from an operational perspective, it is difficult to know if we have truly accomplished anything without knowing the underlying data structure (e.g. what did the model actually tell us about fraudulent transactions?).  In some cases we do not necessarily care to understand what the algorithms are doing as long as the algorithms move the needle but treating these systems as black boxes will potentially cause unintended consequences.


There are simulators mentioned in the links below that attempt to generate accurate data for mining but it was surprising that there were not more simulators.  This might be an excellent area for improvement and not only for payment data but other data where the privacy implications of the underlying data make sharing data difficult., is a company that leverages encryption techniques to allow sharing private data.  Granted still has not cracked the nut that would allow their end users (other data scientists) to truly understand the underlying data.  However, they have taken one step in the correct direction by providing a scoring metric and a compensation system for algorithms that advance's goals in the financial markets.  While the data scientists do not know what the underlying data is they do know if their algorithms contributed successfully and potentially is able to understand the 'why'.


It is a fascinating world we live in and hopefully the machine learning and data mining wave is going to be one that we all surf successfully in the coming years.

Credit Card Fraud Detection Resources



Machine Learning/Data Science Resources


Legal Stuff/References

  • Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015