Karthik TadinadaDr Karthik Tadinada is a senior data scientist with Featurespace’s Financial Services and Insurance clients, working in analysis and data modelling

Last week, a catchily-titled machine learning article (‘Bot makes $2.4m reading Twitter’) caught our eye here at Featurespace. The article described how a bot traded options based on tweets. As pioneers of using machine learning for fraud and risk management in Financial Services, it naturally made us curious about how to create such a bot.

So, how could you do it?

The three components of a machine learning system are:

Lots of high quality data – A large amount of reliable data is a priority for focusing on key elements to make accurate predictions. For example, knowing that I don’t like Terminator does not tell you what movies I do like. However, if you also knew I like Star Wars and Back to the Future, but don’t like Total Recall, then you might conclude I like Sci-Fi, but I’m not a Schwarzenegger fan.

Good signals extracted from the data – Next, the data needs processing to extract further insights. For example, to determine if a credit card transaction was fraudulent, you could calculate the cardholder’s average transaction amount, and identify if the suspect transaction was abnormally large. This is akin to Featurespace’s core strength – monitoring entities in real-time and identifying anomalies in behavioural signals which indicate unusual activity.

An algorithm to combine the signals and generate a prediction – The last step is to combine the information and decide how to weight each piece to generate a prediction. The methods which have had the most success use a large number of examples labelled by humans, which teach a computer to generate accurate predictions.

Now we know what components we need, how do we build our money-making bot?

Step 1: Getting trading data

The original article implied that the bot read the Twitter firehose, picking out tweets that mentioned stocks. The easier option would be to use services that produce news in computer-readable format (e.g. the Dow Jones Elementized news feed). Assuming that we don’t go down this easy (and expensive!) path, we’d have to process the text from Twitter.

A chief complication would be judging the reliability of the information – after all, you can’t believe everything you read on the internet! Each Twitter account could be tracked, perhaps trusting more recently-created accounts less than well-established ones.

Additionally, a list of companies for which we can buy options is needed, so they can be identified in tweets. A machine learning approach is essential – using human observation to monitor these vast data volumes could never be achieved in real-time.

Step 2: Enhancing the data – generating trading signals

The most difficult part of extracting “meaning” from Twitter data is the almost infinite forms in which the same piece of news can be written. Take for example, these tweets about Intel dropping the acquisition of Altera:

TWITTER-LEARNING

One of the major, famous advances in language processing by computers has been a technique introduced in a Google paper called word2vec. This technique extracts “meaning” from words in a deeper way than anything previous. The intuition behind this method is that the words used to express the same concept occur close to each other. It is therefore possible to learn what concept each word represents by looking at the surrounding words. Of course, all the machine does is group words of similar “concepts” together and it is up to humans to give meaning to these clusters. The most powerful thing about this method is that it allows ‘arithmetic’ on words in a way that seems sensible to humans. For example, ‘king’-‘man’+’woman’=’queen’.

So, we could feed our trading bot financial press articles and identify tweets containing words expressing the “acquisition” concept.

Step 3: Generating a trade

Finally, our bot should make large trades only when extremely confident in the information processed – after all, risk management is the most important element for a trader.

A prediction method is needed that returns the probability that each tweet expresses the fact that one company is buying another. This prediction method could take into account the track record of the source, strength of the ‘acquisition’ concept, and market indicators (e.g. volume of trade in the stocks). A technique called logistic regression is a good bet for producing predictions of probability nearly instantly.

Even the size of the bet can be automated. There is a mathematical result called the Kelly Criterion that identifies the correct size of the bet given the probability of the event happening and the odds offered.

So now we now have our bot! A prime example of how machine learning can perform tasks thought to be the sole realm of humans. It’s close to our heart; one of Featurespace’s first projects was working with a computer gaming company to spot bots within their systems. Following this, in 2008 we were approached by Betfair to spot fraudulent players in their Gaming systems. Our Adaptive Behavioural Analytics approach also enabled Betfair to understand legitimate customer behaviour that should not be blocked, letting more genuine transactions through.

Since then, our core ARIC™ software engine has been developed into a series of products for managing fraud, risk and compliance. From fighting fraud for mobile payment platform Zapp, to identifying rogue trading for KPMG, our products are being used by innovative companies to protect and serve their customers.

But don’t worry, we’re using our powers for detecting bots, rather than writing them.

Dr Tadinada studied with Professor Bill Fitzgerald (Co-Founder of Featurespace) at Cambridge University Engineering Department’s Signal Processing Group, and was previously Junior Consultant at the Boston Consulting Group. In his spare time he enjoys contributing to UK mathematics, setting and marking questions for the British Mathematical Olympiads.