Systems like Eliza were good at giving a sophisticated first impression but were easily found out after a few conversational turns. Such systems were built on efforts to collate as much world knowledge as possible, and then formalise it into concepts and how those relate to each other. Concepts and relations were further built into grammar and lexicons that would help analyse and generate natural language from intermediate logical representations. For example, world knowledge may contain facts such as “chocolate is edible” and “rock is not edible”.
Learning from data
Today’s conversational AI systems are different in that they target open domain conversation – there is no limit to the number of topics, questions or instructions a human can ask. This is mainly achieved by completely avoiding any type of intermediate representation or explicit knowledge engineering. In other words, the success of current conversational AI is based on the premise that it knows and understands nothing of the world.
The basic deep learning model underlying most current work in natural language processing is called a recurrent neural network, whereby a model predicts an output sequence of words based on an input sequence of words by means of a probability function that can be deduced from data. Given the user input “How are you?” the model can determine that a statistically frequent response is “I am fine.”
The power of these models lies partially in its simplicity – by avoiding intermediate representations, more data will typically lead to better models and better outputs. Learning for an AI is very similar to how we learn: digest a very large training data set and compare it with known but unseen data (test set). Based on how well the AI performs against the test set, the AI’s predictive model is then adjusted to get better results before the test is repeated.
But how do you determine how good it is? You can look at the grammar of utterances, how “human like” they sound, or the coherence of a contribution in a sequence of conversational turns. The quality of outputs can also be determined as a subjective assessment of how closely they meet expectations. MIT’s DeepDrumpf is a good example – an AI system trained using data from Donald Trump’s Twitter account and which uncannily sounds just like him, commenting on a number of topics such as healthcare, women, or immigration.
However, problems start when models receive “wrong” inputs. Microsoft’s Tay was an attempt to build a conversational AI that would gradually “improve” and become more human-like by having conversations on Twitter. Tay infamously turned from a philanthropist into a political bully with an incoherent and extremist world view within 24 hours of deployment. It was soon taken offline.
As machines learn from us, they also take on our flaws – our ideologies, moods and political views. But unlike us, they don’t learn to control or evaluate them – they only map an input sequence to an output sequence, without any filter or moral compass.