kellog ai: technical documentation

44
Kellog AI: Technical Documentation Release 0.0.1 J.F. Koehler Sep 15, 2019

Upload: others

Post on 29-Dec-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kellog AI: Technical Documentation

Kellog AI: Technical DocumentationRelease 0.0.1

J.F. Koehler

Sep 15, 2019

Page 2: Kellog AI: Technical Documentation
Page 3: Kellog AI: Technical Documentation

Introduction and Python

1 Tools for AI 1

2 Readings 13

3 Introduction to R 15

4 Plotting with R 17

5 Machine Learning with R 19

6 Time Series 21

7 KNIME: Getting Started 23

8 Starting a Workflow 25

9 Customer Churn with KNIME 27

10 Data Visualization with KNIME 31

11 Tensorflow 33

12 Cloud Services: An Overview 35

13 Mathematical Models 37

14 Kaggle Competitions 39

i

Page 4: Kellog AI: Technical Documentation

ii

Page 5: Kellog AI: Technical Documentation

CHAPTER 1

Tools for AI

There are many language new and old capable of carrying out Machine Learning and Artificial Intelligence orientedtasks. In recent years, a few open source tools have come to dominate the space. This guide is meant to give you ahigh level overview of some tools that are freely accessible and ready to plug and play with ML and AI algorithms.

1.1 Languages

According to the StackOverflow developer survey of 2018, Python is the dominant language for data scientists and ma-chine learning specialists. Additionally, the R language, and the SQL database query language, and a newer languagethat is also gaining speed called Julia. (StackOverflow survey)

All of these languages are freely accessible and open source projects. Below, you will find links to each languagesmajor documentation. We will discuss a tool for putting these languages to use next. Later, we focus on introducingthe Python computing language and it’s Pandas library for working with data.

• Python official documentation:https://www.python.org/

• Julia language official documentation: https://julialang.org/

• R documentation:https://www.r-project.org/

• MySQL documentation: https://dev.mysql.com/doc/refman/8.0/en/

1

Page 6: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

2 Chapter 1. Tools for AI

Page 7: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

1.2 Jupyter Notebooks

1.2. Jupyter Notebooks 3

Page 8: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

Jupyter notebooks (jupyter.org) are a powerful tool for interacting with many different languages. The name indicatesthe connection to Julia, Python, and R. They offer an interactive web based interface to use many languages includingthe initial three. You can download the notebooks freely through the Anaconda distribution (here). The notebooks runlocally in your web browser once installed.

Also, a few companies have begun to offer Jupyter notebooks through just a web browser. We will examine a fewoptions in the following section. The notebooks are a wonderful tool for teams and for communicating and sharingresults with stakeholders.

• Videos from Software Carpentry (‘Carpentries Site <>‘__) on installing and getting started with Jupyter note-books.

• Mac: https://www.youtube.com/watch?v=TcSAln46u9U

• Windows: https://www.youtube.com/watch?v=xxQ0mzZ8UvA

• Introduction to Jupyter notebooks tutorial from Real Python: https://realpython.com/jupyter-notebook-introduction/

1.2.1 Jupyter Notebooks on the Web

GOOGLE COLAB

4 Chapter 1. Tools for AI

Page 9: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

1.2. Jupyter Notebooks 5

Page 10: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

Both Google and Microsoft have recently opened up versions of the Jupyter notebooks for use online. For Google, thenotebooks are integrated into your google drive, and are accessible at https://colab.research.google.com . If you havea google login, you can use this to login and save your notebooks to your google drive. More importantly, you can usegoogle’s computational resources to tap GPU resources.

Collab has a number of tutorials available for getting up and running with Machine Learning in the cloud:

• Getting Started with Collab: https://colab.research.google.com/notebooks/basic_features_overview.ipynb

• Loading Data in Colab: https://colab.research.google.com/notebooks/io.ipynb

• Introduction to Pandas: https://colab.research.google.com/notebooks/mlcc/intro_to_pandas.ipynb

Microsoft Azure Notebooks

Microsoft’s Azure Notebooks are similar to Google’s offering. They are accessible through a web browser, and can beconfigured to access additional processing power. They are accesible at https://notebooks.azure.com/.

6 Chapter 1. Tools for AI

Page 11: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

1.2. Jupyter Notebooks 7

Page 12: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

Like Google, Microsoft has a number of tutorials to get up and running with the notebooks.

• Accessing Data with Azure: https://notebooks.azure.com/Microsoft/projects/2018-data-access

• Introduction to Python for Data Analysis: https://notebooks.azure.com/wesm/projects/python-for-data-analysis

1.3 Software for Local Computers

UNIX on Mac

8 Chapter 1. Tools for AI

Page 13: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

1.3. Software for Local Computers 9

Page 14: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

When working locally, we will frequently want to interact with the file system of the machine to create, delete, move,and copy files. Typically, we will use the terminal application to execute this code. On a Mac, you have a terminalapplication already installed. You can find this by using the search bar looking for the Terminal application.

GitBash on Windows

10 Chapter 1. Tools for AI

Page 15: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

1.3. Software for Local Computers 11

Page 16: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

On a Windows machine, certain commands can be executed in the Power Shell, but it is easier to download and installa different application to interact with UNIX. GitBash is a common application for Windows users.

WARNING!!: Be sure to choose the “ADD TO PATH” box during the installation process so you can access yourother programs including Python and Jupyter notebooks.

Resources for Learning Bash

• Software Carpentry lessons on UNIX shell: http://swcarpentry.github.io/shell-novice/

12 Chapter 1. Tools for AI

Page 17: Kellog AI: Technical Documentation

CHAPTER 2

Readings

‘Computing Machinery and Intelligence <readings/turing_computing_machinery.pdf>‘__ by Alan Turing

‘A Logical Calculus of the Idea Immanent in Nervous Activity <readings/mcculloch_pitts_1943.pdf>‘__ by WarrenMcCollough and Walter Pitts

‘The Perceptron: A Probabilistic Model For Information Storage And Organization In The Brain <read-ings/rosenblatt_perceptron.pdf>‘__ by Frank Rosenblatt

13

Page 18: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

14 Chapter 2. Readings

Page 19: Kellog AI: Technical Documentation

CHAPTER 3

Introduction to R

[ ]:

15

Page 20: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

16 Chapter 3. Introduction to R

Page 21: Kellog AI: Technical Documentation

CHAPTER 4

Plotting with R

[ ]:

17

Page 22: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

18 Chapter 4. Plotting with R

Page 23: Kellog AI: Technical Documentation

CHAPTER 5

Machine Learning with R

[ ]:

19

Page 24: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

20 Chapter 5. Machine Learning with R

Page 25: Kellog AI: Technical Documentation

CHAPTER 6

Time Series

[ ]:

21

Page 26: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

22 Chapter 6. Time Series

Page 27: Kellog AI: Technical Documentation

CHAPTER 7

KNIME: Getting Started

This week we focus on using the KNIME platform to implement a few basic data analysis workflows. KNIME is anopen source analytics platform that provides a graphical interface and integrates with many other open source dataanalytics tools. From the creaters of KNIME:

Our KNIME Analytics Platform is the leading open solution for data-driven innovation, designed fordiscovering the potential hidden in data, mining for fresh insights, or predicting new futures. Organiza-tions can take their collaboration, productivity and performance to the next level with a robust range ofcommercial extensions to our open source platform.

To begin, we cover getting up and running with KNIME by downloading and installing KNIME. Please visit https://www.knime.com/download to download KNIME and follow the instructions in the video below.

[2]: from IPython.display import HTMLHTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/yeHblDxakLk?→˓rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

[2]: <IPython.core.display.HTML object>

7.1 Installing Extensions

The extensions contain many useful additions to the KNIME platforms. For example, in the basic download we don’thave access to integrations with Python scripts. If we install the extensions, we will be able to utilize the PythonScripts extension and include Python scripts in our workflows.

[3]: from IPython.display import HTMLHTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/8HMx3mjJXiw?→˓rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

[3]: <IPython.core.display.HTML object>

23

Page 28: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

24 Chapter 7. KNIME: Getting Started

Page 29: Kellog AI: Technical Documentation

CHAPTER 8

Starting a Workflow

If you have installed KNIME Analytics Platform and you are wondering how to create your first workflow, this videowill show you how to create a new workspace, a new workflow, or a new workflow group.

[3]: from IPython.display import HTMLHTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/-JtO7DW9Jr0?→˓rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

[3]: <IPython.core.display.HTML object>

8.1 Creating and Configuring Nodes

This video shows how to create, configure, execute, reset, and inspect the results of a node in KNIME AnalyticsPlatform. It also offers an overview of the context menu and commands available for each node.

[4]: from IPython.display import HTMLHTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/fMM_w4v5zZc?→˓rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

[4]: <IPython.core.display.HTML object>

25

Page 30: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

26 Chapter 8. Starting a Workflow

Page 31: Kellog AI: Technical Documentation

CHAPTER 9

Customer Churn with KNIME

This example is located in the EXAMPLES folder and the data is included. There are two data files, the Contract dataand the CallData file. The video below walks through building the following workflow:

[1]: from IPython.display import HTMLHTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/n8HbUUc51fc?→˓rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

[1]: <IPython.core.display.HTML object>

27

Page 32: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

9.1 Credit Scoring with KNIME

This workflow can be found on the KNIME EXAMPLES Server under 50_Applications/02_Credit_Scoring/01_CreditScoring.

This KNIME workflow focuses on creating a credit scoring model based on historical data. As with all data miningmodeling activities, it is unclear in advance which analytic method is most suitable. This workflow therefore uses threedifferent methods simultaneously – Decision Trees, Neural Networking and SVM – then automatically determineswhich model is most accurate and writes that model out for further use.

This workflow manipulates the data so it is suitable for a variety of modeling techniques by converting nominalsto numerics. The data was enhanced so that understandable labels are used. It uses metanodes to “package” eachtechnique suitable for reuse. Each Model uses a Test / Learn and cross validated process to ensure accuracy. Theworkflow writes out the model in the official PMML format, so that other applications can use the model.

28 Chapter 9. Customer Churn with KNIME

Page 33: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

The data is German Credit data provided by

Professor Dr. Hans Hofmann

Institut für Statistik und Ökonometrie

Universität Hamburg

FB Wirtschaftswissenschaften

Von-Melle-Park 5 2000 Hamburg 13

Available at http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

9.1. Credit Scoring with KNIME 29

Page 34: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

30 Chapter 9. Customer Churn with KNIME

Page 35: Kellog AI: Technical Documentation

CHAPTER 10

Data Visualization with KNIME

This video introduces some basic uses of KNIME visualization nodes.

[2]: from IPython.display import HTMLHTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/aGGaIuloo0s?→˓rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

[2]: <IPython.core.display.HTML object>

10.1 Clustering and Visualization

KNIME Analytics Platform 3.7 brings offers some new JavaScript based nodes. Use the Heatmap node to spot patterns,the Hierarchical Cluster Assigner node to create a dendrogram and explore clusters, and the Tile View* node tovisualize complex datasets including images and other graphics. All 3 nodes support custom CSS styling, which canbe done with the CSS Editor node. Use nested wrapped metanodes to bring all created views together and the newvisual feature of the Layout Editor to arrange them.

[4]: from IPython.display import HTMLHTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/P1cb0Wo7qx8?→˓rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

[4]: <IPython.core.display.HTML object>

31

Page 36: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

32 Chapter 10. Data Visualization with KNIME

Page 37: Kellog AI: Technical Documentation

CHAPTER 11

Tensorflow

This week, we introduce the Tensorflow ecosystem for tackling AI and deep learning problems. The notebooks belowcan be run locally or using the Google Colab links at the top of each notebook.

import tensorflow as tfmnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([tf.keras.layers.Flatten(input_shape=(28, 28)),tf.keras.layers.Dense(512, activation=tf.nn.relu),tf.keras.layers.Dropout(0.2),tf.keras.layers.Dense(10, activation=tf.nn.softmax)

])model.compile(optimizer='adam',

loss='sparse_categorical_crossentropy',metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)model.evaluate(x_test, y_test)

11.1 Notebooks

• **Basic Classification**: This guide trains a neural network model to classify images of clothing, like sneak-ers and shirts. It’s okay if you don’t understand all the details, this is a fast-paced overview of a completeTensorFlow program with the details explained as we go.

• **Text Classification**: This notebook classifies movie reviews as positive or negative using the text of thereview. This is an example of binary—or two-class—classification, an important and widely applicable kind ofmachine learning problem.

33

Page 38: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

We’ll use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. Theseare split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced,meaning they contain an equal number of positive and negative reviews.

• **Basic Regression**: In a regression problem, we aim to predict the output of a continuous value, like a priceor a probability. Contrast this with a classification problem, where we aim to select a class from a list of classes(for example, where a picture contains an apple or an orange, recognizing which fruit is in the picture).

This notebook uses the classic Auto MPG Dataset and builds a model to predict the fuel efficiency of late-1970s andearly 1980s automobiles. To do this, we’ll provide the model with a description of many automobiles from that timeperiod. This description includes attributes like: cylinders, displacement, horsepower, and weight.

• **Overfit and Underfit**: As always, the code in this example will use the tf.keras API, which you can learnmore about in the TensorFlow Keras guide.

In both of the previous examples—classifying movie reviews, and predicting fuel efficiency—we saw that the accuracyof our model on the validation data would peak after training for a number of epochs, and would then start decreasing.

In other words, our model would overfit to the training data. Learning how to deal with overfitting is important.Although it’s often possible to achieve high accuracy on the training set, what we really want is to develop models thatgeneralize well to a testing data (or data they haven’t seen before).

The opposite of overfitting is underfitting. Underfitting occurs when there is still room for improvement on the testdata. This can happen for a number of reasons: If the model is not powerful enough, is over-regularized, or has simplynot been trained long enough. This means the network has not learned the relevant patterns in the training data.

If you train for too long though, the model will start to overfit and learn patterns from the training data that don’tgeneralize to the test data. We need to strike a balance. Understanding how to train for an appropriate number ofepochs as we’ll explore below is a useful skill.

To prevent overfitting, the best solution is to use more training data. A model trained on more data will naturallygeneralize better. When that is no longer possible, the next best solution is to use techniques like regularization.These place constraints on the quantity and type of information your model can store. If a network can only afford tomemorize a small number of patterns, the optimization process will force it to focus on the most prominent patterns,which have a better chance of generalizing well.

In this notebook, we’ll explore two common regularization techniques—weight regularization and dropout—and usethem to improve our IMDB movie review classification notebook.

• **Saving Models**: Model progress can be saved during—and after—training. This means a model can resumewhere it left off and avoid long training times. Saving also means you can share your model and others canrecreate your work. When publishing research models and techniques, most machine learning practitionersshare:

• code to create the model, and

• the trained weights, or parameters, for the model

Sharing this data helps others understand how the model works and try it themselves with new data.

• **Keras Overview**: Complete overview of Keras interface for tensorflow library.

[ ]:

34 Chapter 11. Tensorflow

Page 39: Kellog AI: Technical Documentation

CHAPTER 12

Cloud Services: An Overview

This week we aim to give an overview of a few popular cloud services that can be useful for both training anddistributing AI models.

12.1 Google Cloud Platform

• **GCP Overview**: This overview is designed to help you understand the overall landscape of Google CloudPlatform (GCP). Here, you’ll take a brief look at some of the commonly used features and get pointers todocumentation that can help you go deeper. Knowing what’s available and how the parts work together can helpyou make decisions about how to proceed. You’ll also get pointers to some tutorials that you can use to try outGCP in various scenarios.

• **Best Practices for Enterprise Organizations**: This guide introduces best practices to help enterprise cus-tomers like you on your journey to Google Cloud Platform (GCP). The guide is not an exhaustive list of recom-mendations. Instead, its goal is to help enterprise architects and technology stakeholders understand the scopeof activities and plan accordingly. Each section provides key actions and includes links for further reading.

12.2 Microsoft Azure Platform

• **Overview of Microsoft Cloud Operating Model**: This module covers the basics of how to use the CloudOperating Model, a tool that helps you recognize where your organization is in its cloud adoption journey andidentify next steps. It also provides tools to support you in the adoption of cloud computing. We’ll describe howto identify opportunities and develop strategies based on real-world examples of cloud adoption.

• **Introduction to Azure Solutions**: Get started with your digital transformation by exploring why Azure isthe right platform for your organization and the overall value Azure can bring. We’ll demonstrate how digitaltransformation with Azure can empower employees, engage customers, optimize operations, and transformproducts.

• **Microsoft Azure AI strategy and solutions**: This module provides an overview of Azure AI and demon-strates how Microsoft tools, services, and infrastructure can help make AI real for your organization, whether

35

Page 40: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

you want to unlock insights from your latent data with knowledge mining, develop your own AI models withmachine learning, or build immersive apps using AI.

12.3 Amazon Web Services

• **AWS Overview**: Amazon Web Services offers a broad set of global cloud-based products including com-pute, storage, databases, analytics, networking, mobile, developer tools, management tools, IoT, security, andenterprise applications: on-demand, available in seconds, with pay-as-you-go pricing. From data warehousingto deployment tools, directories to content delivery, over 140 AWS services are available. New services can beprovisioned quickly, without the upfront capital expense. This allows enterprises, start-ups, small and medium-sized businesses, and customers in the public sector to access the building blocks they need to respond quicklyto changing business requirements. This whitepaper provides you with an overview of the benefits of the AWSCloud and introduces you to the services that make up the platform.

• **AWS Machine Learning Overview**: This course introduces Amazon Machine Learning and Artificial In-telligence tools that enable capabilities across frameworks and infrastructure, machine learning platforms, andAPI-driven services.

• **AWS Intro to AI**: In the course, we discuss what AI is and why it is important, and take a brief lookat machine learning and deep learning—which are subsets of AI—and describe how Amazon uses AI in itsproducts.

• **Training Models in AWS**: In this tutorial, you will train a TensorFlow machine learning model on anAmazon EC2 instance using the AWS Deep Learning Containers.

• **AWS Use Case: Call Center Management**: This training introduces you to the practical Amazon approachto machine learning (ML).

36 Chapter 12. Cloud Services: An Overview

Page 41: Kellog AI: Technical Documentation

CHAPTER 13

Mathematical Models

This weeks resources focus on providing more detailed information about the mathematics behind many models wehave encountered to this point. The materials are a mixture of readings and jupyter notebooks, and are broken intostandard conceptual blocks.

13.1 Supervised Learning

In supervised learning, we know the labels of some data that we would like to predict. Both classification and regres-sion are examples of supervised learning problems.

• **Introduction to Supervised Learning**: Chapter from Introduction to Statistical Learning discussing bigpicture of regression and classification problems with accompanying R code.

• **Overview of Supervised Learning**: Chapter from Elements of Statistical Learning (ESL) on big idea ofsupervised learning.

13.1.1 Linear Regression

• **PYDSHB: Linear Regression in Depth**: Chapter from Python Data Science Handbook with accompanyingPython code.

• **ISLR: Linear Regression Overview**: Introductory level description of Linear Regression from the ISLRtextbook. Includes relevant R code.

• **ELEMENTS: Linear Regression Deep**: Rigorous mathematical description of linear regression from theElements of Statistical Learning.

• **Notes on Regression**: Notes from Andrew Ng’s class introducing Linear Regression.

37

Page 42: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

13.1.2 Classification

• **PYDSHB: Naive Bayes in Depth**: Overview of classification using Bayes Theorem from the Python DataScience Handbook. Jupyter notebook with accompanying Python code.

• **ISLR: Classification**: Introductory mathematical and statistical presentation of classification using LogisticRegression from ISLR textbook. Contains relevant R code.

• **ESL: Classification**: Rigorous development of classification from Elements of Statistical Learning.

• **Notes on Classification**: Notes from Andrew Ng’s class introducing classification and Logistic Regression.

13.2 Unsupervised Learning

• **PYDSHB**: KMeans Clustering: KMeans from Python Data Science Handbook including Python code andexamples.

• **PYDSHB**: Gaussian Mixture Models: Gaussian Mixture Models in Python and examples from the PythonData Science Handbook.

• **ISLR: Unsupervised Learning**: Chapter from Introduction to Statistical Learning introducing unsupervisedlearning with accompanying R code.

• **ESL: Unsupervised Learning**: Rigorous introduction to unsupervised learning from the Elements of Statis-tical Learning.

38 Chapter 13. Mathematical Models

Page 43: Kellog AI: Technical Documentation

CHAPTER 14

Kaggle Competitions

Kaggle is a great way to learn and practice data science. Below you can find links to two contests to get you started aswell as tutorials for helping improve your solutions in both Python and R.

14.1 Titanic: Learning from Disaster

Competition Description: The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. OnApril 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224passengers and crew. This sensational tragedy shocked the international community and led to better safety regulationsfor ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengersand crew. Although there was some element of luck involved in surviving the sinking, some groups of people weremore likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, weask you to apply the tools of machine learning to predict which passengers survived the tragedy.

Tutorials:

• Titanic Data Science Solutions: The notebook walks us through a typical workflow for solving data sciencecompetitions at sites like Kaggle.

• Interactive Data Science Tutorial: Based on Titanic Kaggle competition.

• Machine Learning and Scikit-Learn: This notebook covers the basic Machine Learning process in Python step-by-step. Go from raw data to at least 78% accuracy on the Titanic Survivors dataset.

• XGBoost Example: Popular approach to improving classification models and responsible for many winningsolutions.

39

Page 44: Kellog AI: Technical Documentation

Kellog AI: Technical Documentation, Release 0.0.1

14.2 House Prices: Advanced Regression Techniques

Competition Description

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceilingor the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influencesprice negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competitionchallenges you to predict the final price of each home.

Tutorials:

• Fun with Real Estate Data: Linear, Random Forest, and XGBoost models in R for predicting sale prices.

• XGBoost and Parameter Tuning: Example in R of using XGBoost and Hyperparameter tuning to predict thehousing prices.

• A Study on Regression Models: This kernel is an attempt to use every trick in the books to unleash the fullpower of Linear Regression, including a lot of preprocessing and a look at several Regularization algorithms.

• search

40 Chapter 14. Kaggle Competitions