Make Smart Predictions with Amazon Machine Learning

The Temboo Choreo library just got a new addition: Amazon’s Machine Learning service. It’s an excellent way to get started with data-driven predictions in any application without bringing on a Machine Learning specialist. If you’re looking for a straightforward supervised Machine Learning solution and your application doesn’t call for a custom implementation, Amazon Machine Learning may be just the tool you’re looking for. It’s a version of one of the Machine Learning implementations that Amazon itself uses internally, so scalability is certainly a feature.

In this article, we’ll give you a crash course in the basic concepts of Machine Learning, tell you where to start with Amazon Machine Learning and its API, and give you a few pointers on how you might approach applying Machine Learning to your Internet of Things application.

Amazon Machine Learning is designed for supervised Machine Learning tasks. Not so sure what we mean by that? No worries, just read on.

Putting Machine Learning into Context

Here’s an example of just one of the infinite tasks that can be done with Machine Learning:

Let’s say you make your living selling houses and you want to make pricing the houses you sell an easier task. You might decide to train a Machine Learning model to predict the price at which you should sell the house. To do so, you collect all the data you can find about houses sold in your area in the last five years or so.

Your data set includes several data points describing each house’s features, like its area in square feet and the number of bedrooms it has. For each house, you also include the price at which it was sold. You make sure your data is clean and ready for processing, and you feed it into your Machine Learning model to “teach” it how to think about house prices. Then you test your model out on some other house data whose price you already know. How well it guesses their actual sale prices lets you know how representative your initial data set was, and thus, how helpful your Machine Learning model will be.

If your model is effective, you can give it data for a house you’re hoping to sell, and then get a good prediction for the price you should sell it for. What you do with that prediction is up to you.

Machine Learning Fundamentals

The underlying basis of a Machine Learning application is a statistical model that improves itself as it is given more data. In Google AI researcher François Chollet’s book, Deep Learning with Python, he describes the distinction between classical programming and Machine Learning as follows: classical programming uses rules and data to produce answers. Machine Learning, on the other hand, uses data and answers to produce rules.

It’s important to understand that not every problem is a good candidate for Machine Learning. Many of the questions you may have for your data set are perfectly answerable using classic analytical methods. So when is Machine Learning a good choice to solve the task at hand? Amazon Machine Learning’s documentation helpfully identifies the following two simple rules for deciding whether ML is the right choice:

If you can’t explicitly program the behavior you want because it’s unclear exactly what the logical rules to produce that behavior are, as lots of variables influence the outcome
If it’s unreasonable to perform a task at scale even if it’s simple enough to perform on an individual basis

Types of Machine Learning

The major types of Machine Learning are characterized by the kind of feedback given to the Machine Learning algorithm during its “learning” phase. Though Amazon Machine Learning is for performing supervised Machine Learning tasks, it is useful to have a basic understanding of the capabilities of each of the following major types:

Unsupervised
This kind of machine learning model finds the patterns in the data even though it is given no human help in interpreting the data.
In Context: An example of an unsupervised ML task is “clustering”, in which the ML model determines which items in the data sets are related to each other. An example of clustering might be grouping similar news articles together without any humans needing to read them and tag them with metadata.
Reinforcement
This kind of machine learning makes attempts at solving a problem or making a prediction and is told by a human teacher whether the result is correct.
In Context: Reinforcement learning is used for many things, including teaching an ML model how to successfully play a video game, or teaching a robot hand how to pick something up.
Supervised This kind of machine learning is given a large volume of complete sets of data, after which it is asked to try to predict the value of a missing data point inside an incomplete set of data. This is what we can use Amazon Machine Learning to do.
In Context: Our house pricing example given at the beginning of the article is a supervised Machine Learning task. Other examples include an ML model trained to detect whether a given email is or isn’t spam. The training data for a model to perform such a task would be made up of many examples of emails that a human has identified and labeled as being either spam or not spam. That data set should contain both spam and non-spam examples.

Terms to Know

Prediction – In supervised Machine Learning we talk about our ML algorithm making predictions. That is, it makes its best guess at an answer based on the rules it has learned from the example data we gave it.Note that the use of the word “predict” in this context does not necessarily imply that the target value is something about the future. Rather, it is simply an unknown value for which we want a quality guess, no matter when it came about or will come about.
In Context: In our house pricing example, we’re getting predictions for the sale price of a house.
Model – The Machine Learning model is the heart of a Machine Learning application. It’s the statistical algorithm that we “teach” to make predictions.
Feature – Features are the data points that we already know. We will ask our Machine Learning model to make a prediction based on these known data points.
In Context: In our house pricing example, relevant features might include the number of bedrooms, and the year of the house’s construction.
Target – In supervised Machine Learning, the value we are asking our ML model to predict is called the target value, or simply, the target.
In Context: In our housing example, the target of our predictions is the sale price of the house.
Labeled Data – Labeled data is a data set made up of complete observation data, i.e., a set of features that includes the correct corresponding target value.
In Context: Labeled data for our housing example might look something like this if we were to format it as a table:

Area (sqft)	Floors	BR	BA	Yard (sqft)	Garage	Pool	School District	Exterior	Roof Material	Year Built	Year Sold	Price
1601	1	3	2	11902	y	n	109	Wood	Asphalt	1974	2008	145111
1840	1	3	2	128840	n	y	108	Brick	Asphalt	1946	2017	220000

Training – This is the “Learning” component of Machine Learning. In this part of the process, we help the model understand the relationships between the features we already know and the target value that we will ask it to predict based on that data. This is done by giving it a lot of labeled data.
Classification vs. Regression – In supervised learning, there are two types of predictions that might be made: If your target value is a numerical value, that type of prediction is known as regression task.

In Context: If we want our house pricing model to tell us the number of thousands of dollars a given house might sell for, then our prediction task is a regression task.

On the other hand, if you would like to predict the value of a target that can only be one of a finite set of values, that’s known as a classification task. There are two types of classification tasks: binary and multiclass classification, where binary classification tasks predict a target that has two possible values, and multiclass classification tasks predict a target that has three or more potential target values.

In Context: If for some reason we only needed to know whether the given house would sell for more or less than $300,000, then that’s a binary classification task. If we had “bucketed” the house prices into sub-ranges of something like, “less than $200K”, “$200-$400k”, “$400k-$700k”, and “more than $700k”, and we only wanted predictions about which of those ranges the given house would fall into, that would be a multiclass classification task.

Fit – The accuracy or usefulness of a Machine Learning model is known as the model’s fit. Learn more about model fit in the Amazon Machine Learning documentation.

The Supervised Machine Learning Process with Amazon Machine Learning

The quickest way to understand the process is hands-on with their interactive Amazon Machine Learning tutorial, but we’ll give you a quick rundown and show which Temboo Choreos you can use at each step along the way.

1. Prepare a Training Dataset

The majority of your effort in performing any type of Machine Learning task will be spent on the most important step of all: preparing a dataset for training your machine learning model.

Collecting and cleaning the data requires care, and the time you take to plan ahead will have a strong influence on the worth of the Machine Learning model that results from training with this data.Before you do anything else, you should determine what question you’re asking of your data. Consider whether you really need a numerical value as an answer, or whether your problem is actually a classification task.

The principle of garbage in, garbage out applies to Machine Learning as much as it does to anything that relies on statistical methods or data processing. It’s important to make sure that your training data is made up of observations consisting of data points that are relevant to your target.

It’s possible that not all of the data you have available will be meaningful features for your eventual training dataset. Giving your data a thorough check-up with traditional analytical methods will help you determine which features should stay in your training data and which ones should get thrown out.

Here’s what to look for when analyzing your potential training data.You may find it necessary to perform some preliminary computations on your initial feature values in order to turn them into useful values for ML model training purposes.

2. Train the Machine Learning Model

Once your data is collected, cleaned, and properly formatted, it’s time to upload it to your AWS S3 bucket and create a new data source from the Amazon Machine Learning console.There are Temboo Choreos for every step of the Training process, but unless you’re regularly creating multiple Machine Learning models, it’s probably more efficient to take care of these steps in the ML Console than to do it programmatically. Here are the Choreos you would need:

CreateDataSourceFromS3 – Creates a DataSource object.
CreateMLModel – Creates a new MLModel using the DataSource and the recipe as information sources.
DescribeMLModels – Returns a list of MLModels that match the search criteria in the request.
GetMLModel – Returns an MLModel that includes detailed metadata, data source information, and the current status of the MLModel.
UpdateMLModel – Updates the MLModelName and the ScoreThreshold of an MLModel.
AddTags – Adds one or more tags to an object, up to a limit of 10.
DescribeTags – Describes one or more of the tags for your Amazon Machine Learning object.
DeleteTags – Deletes the specified tags associated with an ML object.
DeleteMLModel – Assigns the DELETED status to an MLModel, rendering it unusable.
DeleteDataSource – Assigns the DELETED status to a DataSource, rendering it unusable.

Temboo also has Choreos for AWS S3, which can come in handy when uploading new datasets for creating data sources.

3. Evaluate the Accuracy of the Machine Learning Model

When training a machine learning model in supervised machine learning, we set aside a portion of the training data set for testing the model after we have trained it. This way, we can get an approximate idea whether our Machine Learning model turned out to be accurate after training.

Amazon Machine Learning does this for you automatically and provides simple utilities for evaluating your ML model in the future, should you have new labeled data available to use for the evaluation.If you would like to regularly and programmatically evaluate your ML Model using a new labeled dataset, you’ll want to use these Choreos:

Temboo’s AWS S3 Choreos – Everything you need for S3 file management
CreateDataSourceFromS3 – Creates a DataSource object.
CreateEvaluation – Creates a new Evaluation of an MLModel.
DescribeEvaluations – Returns a list of Evaluations that match the search criteria in the request.
GetEvaluation – Returns an Evaluation that includes metadata as well as the current status of the Evaluation.
UpdateEvaluation – Updates the EvaluationName of an Evaluation.
AddTags – Adds one or more tags to an object, up to a limit of 10.
DescribeTags – Describes one or more of the tags for your Amazon Machine Learning object.
DeleteTags – Deletes the specified tags associated with an ML object.
DeleteEvaluation – Assigns the DELETED status to an Evaluation, rendering it unusable.
DeleteDataSource – Assigns the DELETED status to a DataSource, rendering it unusable.

4. Use the Model to Make Predictions

Now you’re ready to benefit from all your hard work and generate predictions from your data.

In Amazon Machine Learning, there are two ways to make predictions. You can make individual predictions in realtime, or you can make batch predictions for multiple observations all at once. You should use the method that’s appropriate for the level of urgency for your application in accessing those predictions. The difference between the two is that batch predictions can handle multiple rows of observation data and take more time to produce.

The dataset you’ll send to Amazon Machine Learning to generate predictions should look exactly the same as your training dataset, with one exception: it won’t include the target value.

For batch predictions, you’ll need to upload a properly formatted CSV file containing rows of observations to AWS S3, then you’ll create an Amazon Machine Learning data source from that file.

For realtime predictions, you can either manually enter observation data in the Amazon Machine Learning console, or you can use the Amazon Machine Learning API. When using the API, you’ll need to build a JSON string containing your set of observations. It may take a bit of experimentation to get your JSON string properly formatted to match the schema of your training dataset.

Temboo’s AWS S3 Choreos – Everything you need for S3 file management
CreateDataSourceFromS3 – Creates a DataSource object.
CreateRealtimeEndpoint – This Choreo creates a real-time endpoint for the MLModel. The endpoint contains the URI of the MLModel which is the location to send real-time prediction requests for the specified MLModel.
Predict – Generates a prediction for the observation using the specified ML Model.
CreateBatchPrediction – Generates predictions for a group of observations.
UpdateBatchPrediction – Updates the BatchPredictionName of a BatchPrediction.
AddTags – Adds one or more tags to an object, up to a limit of 10.
DescribeTags – Describes one or more of the tags for your Amazon Machine Learning object.
DeleteTags – Deletes the specified tags associated with an ML object.
DeleteBatchPrediction – Assigns the DELETED status to a BatchPrediction, rendering it unusable.
DeleteDataSource – Assigns the DELETED status to a DataSource, rendering it unusable.
DeleteRealtimeEndpoint – Deletes a real time endpoint of an MLModel.

Machine Learning for the IoT

Using supervised Machine Learning in conjuction with the Internet of Things presents some exciting possibilities and interesting challenges. IoT has tremendous potential to generate vast amounts of data, and the opportunity is ripe for Machine Learning to play an impactful role.

The first challenge of Machine Learning for IoT is the relatively limited amount of computational resources found on embedded devices. This is the very reason that the cloud and edge device model has become so prevalent: we’re offloading intensive processing to a more powerful, more central computer. It’s the perfect context for a Machine Learning cloud service.

Perhaps the greatest hurdle for any ML application, IoT or otherwise, is collecting a meaningful and representative training dataset. Understand that this step may be very time consuming, depending on the nature of the data you’re collecting.

To build your initial training dataset, consider the following: depending on what your model will be predicting you may need to collect datapoints from multiple devices, perhaps in disparate physical locations. You may also consider gathering some of your data from third-party sources, such as local weather data from a weather service API, or generating Natural Language metrics using a service like the Google Cloud Natural Language API.

Depending on the number of sources that pool to create your dataset, you may need to gather all of your data points in a central location to then properly format and send to AWS. It’s up to you whether you do that locally on your own server or a gateway device, or you choose to do it through a cloud services database API, or something as simple as Google Sheets.

Applications of Machine Learning in IoT include predictive maintenance and optimizing equipment performance and system efficiency. For example, to understand factory equipment, you might begin gathering data about energy expenditure, vibration, temperature, product scrap rates, other product metrics, and machinery malfunctions. That’s just the beginning. Anywhere you place devices to monitor physical conditions could have potential as a Machine Learning data source.