Neural songwriter – Generating Lyrics for any given Image using Multi-modal Semi-Supervised Machine Learning

There is a strong correlation between images that we imagine and a song being played. For example for romantic love song, the images would normally happy, shining, beautiful loving people etc. We decided to train a set of random images along with songs and then try to generate a song stanza for any given random image using machine learning.

Below are some of the results obtained by this project. On the left is some random image passed to the model & on the left is the lyrics generated by the model by making “contextual” sense after “perceiving” the image. The results below looks interesting as it is hard to identify if these lyrical stanzas are created by machine or humans.



Multi-modal learning, whereby machine learning training set consist of more than single modes of data (text, images, audio etc.)  & simultaneously trained and derived results. For example images and text are trained together so that looking at image; the “context” can be produced of the image. This internally uses various deep learning methods like RNN (Recurrent Neural Network) / LSTM (Long Short term Memory). It gives a “perspective” prediction of the given object, based on the “style” of trained data. For example, if there is an image of “wall”, then we can use this to generate text which may be spoken by Trump or Hillary or someone else. We found this fascinating new development, which bring machine closer to art & human perception and decided to extend it by trying to apply it to train on song’s lyrics and see if we can generate some meaningful lyrical quotes. There is a strong correlation between images that shown and a song being played. For example for romantic love song, the images would normally happy, shining, beautiful loving people etc. We decided to train a set of images along with songs and then try to generate a song stanza for any given random image.

It would first train a model using semi-supervised dataset like Flicker, and train it on captions so as to generate the captions for the image. Later another model will be trained with certain style element like lyrics by using concept of skip thought vectors; and able to create lyrics based on that trained style. Lastly the output of first model (which generates the captions) is fed as an input to second model (which generates style based lyrics) to generate “contextual lyrics” based on an image. It will harness the power of RNN since RNN can train a model which is based on not just the current input but all the other inputs before it. So operating on sequence of vectors and sequencing both the input and the output.

It relies heavily on the neural architecture developed by open source contributors in this field specially by Karparthy with MIT.

1. Introduction

Machine learning has been used for a long time to recognize patterns. However recently the idea that machine can be used to “create” the patterns has caught the imagination of everyone. The idea of machines being able to create art by mimicking the known artistic styles or given any input provide a “human like perspective” as an output has been new frontier of machine learning.

While RNN has been known for a while, it’s only been very recently been found very successful to train such models which can take a sequence of input and produce sequence of output. (This is in sharp contrast to typical Neural Networks or Convolutional Neural Networks that they accept a fixed-sized vector as input (e.g. an image) and produce a fixed-sized vector as output e.g. probabilities of different classes). RNNs combine the input vector with their state vector with a fixed (but learned) function to produce a new state vector. Even though actual computation is being done using LSTM , the conceptual method described below remains the same. Below contents explains at higher level how the underlying model works.

1.1     Char-RNN Model [1]

Karpathy has best explained this model as below.

RNN accept an input vector x and give you an output vector y. However this output vector’s contents are influenced not only by the input you just fed in, but also on the entire history of inputs you’ve fed in in the past. RNN also used something like hidden vector h so that has some internal state that it gets to update every time step is called. The forward pass of this network is like below

The tanh implements a non-linearity that squashes the activations to the range [-1,1]. The above equation uses one parameters that is based on the previous hidden state and one is that is based on the current input and creates state vector.

The matrices of the RNN is initialized with random numbers and the bulk of work during training goes into finding the matrices that give rise to desirable behavior, as measured with some loss function that expresses your preference to what kinds of outputs y you’d like to see in response to your input sequences x. This is then stacked using classical method of deep learning by using multiple layers of RNN’s so that output of one stage is given as an input to another stage.


An example RNN with 4-dimensional input and output layers, and a hidden layer of 3 units (neurons). This diagram shows the activations in the forward pass when the RNN is fed the characters “hell” as input. The output layer contains confidences the RNN assigns for the next character (vocabulary is “h,e,l,o”); We want the green numbers to be high and red numbers to be low.

The models used the concept of so called “skip-thought” vectors. For example, we see that in the first time step when the RNN saw the character “h” it assigned confidence of 1.0 to the next letter being “h”, 2.2 to letter “e”, -3.0 to “l”, and 4.1 to “o”. Since in our training data (the string “hello”) the next correct character is “e”, we would like to increase its confidence (green) and decrease the confidence of all other letters (red). Similarly, we have a desired target character at every one of the 4 time steps that we’d like the network to assign a greater confidence to. Since the RNN consists entirely of differentiable operations we can run the backpropagation algorithm to figure out in what direction we should adjust every one of its weights to increase the scores of the correct targets (green bold numbers). We can then perform a parameter update, which nudges every weight a tiny amount in this gradient direction. If we were to feed the same inputs to the RNN after the parameter update we would find that the scores of the correct characters (e.g. “e” in the first time step) would be slightly higher (e.g. 2.3 instead of 2.2), and the scores of incorrect characters would be slightly lower.

This is then repeated process over and over many times until the network converges and its predictions are eventually consistent with the training data in that correct characters are always predicted next. The first time the character “l” is input, the target is “l”, but the second time the target is “o”. The RNN therefore cannot rely on the input alone and must use its recurrent connection to keep track of the context to achieve this task.

At test time, we feed a character into the RNN and get a distribution over what characters are likely to come next. We sample from this distribution, and feed it right back in to get the next letter. This repeated to generate the text based on style of trained data. The RNN is trained with mini-batch Stochastic Gradient Descent and use Softmax classifier on every output vector simultaneously.

1.2     Image to Caption RNN Model[1]

This model uses same concept as the above but is trained on semi-supervised dataset of images. It may be worthwhile to explain what semi-supervised dataset consists of here. It contains images and caption written by the volunteers. This is only input in the whole process that has some form of human interaction. Currently there are very few such datasets available, like Flicker which is created by University of Illinois and COCO which was released by Microsoft for competition. Intuitively speaking the model is trained differently to recognize the image their sentence descriptions to learn about the inter-modal correspondences between language and visual data. It used CNN and Multimodal RNN to generate descriptions for the images. Again LSTM is used for implementation of the RNN.


The model assumes an input dataset of images and their sentence descriptions. The key intuition is that sentences written by people make frequent references to some particular, but unknown location in the image. For example the words “Tabby cat is leaning” refer to the cat, the words “wooden table” refer to the table, etc. We would like to infer these latent correspondences, with the eventual goal of later learning to generate these snippets from image regions. It describes neural networks that map words and image regions into a common, multimodal embedding. Then we introduce our novel objective, which learns the embedding representations so that semantically similar concepts across the two modalities occupy nearby regions of the space. CNN first transforms pixels into dimensional activations of fully connected layer before the classifier. Every image is thus represented as set of h dimensional vectors. To establish the inter-modal relationships, we would like to represent the words in the sentence in the same h dimensional embedding space that the image regions occupy. Then it uses a Bidirectional Recurrent Neural Network (BRNN) to compute the word representations (shown above as LSTM). The BRNN takes a sequence of N words (encoded in a 1-of-k representation) and transforms each one into an h-dimensional vector. This then follows typical implementation of RNN. The activation function is set to rectified linear unit (ReLU), which computes f. The training of RNN is exactly same as described in previous model. Finally an optimization method is used which is SGD with mini-batches to optimize the model. [1]

1.3     Combine the two models

We tried to combine the two models of above and see if we can generate descriptions based on images. It intuitively takes captions generated for images and then based on style another trained data, creates the outputs.[10]


So in the above example, we have image to text model that is for example trained on MCOCOCO semi supervised dataset. There is another model which is for example trained on the book data. Now the caption generated by the first model for a given image ; can be instead be generated by the “style” of second trained dataset ; by priming or seeding the second model with it.

2.Method & Experiments

2.1     Data Collection- Images with captions

We collected two types of data to train the models, the images with captions (flicker dataset) , and text styles (song lyrics).

There are very few available semi supervised dataset of images. We have used the Flicker dataset of images with caption created by University of Illinois Urbana Champaign. We have requested them to share this dataset with us and they have provided the same. This becomes our images training dataset. [3]

2.2       Data Collection- Song Lyrics

Below are the steps carried out to collect the song lyrics.  The code is also included in the end.

  1. Manually download the subset of millionsongs dataset. This contains 10000 songs and all the metadata about it. This is dataset created by Columbia University for academic purposes. [5]
  2. Extract the metadata from this dataset, specially extract artist and title and add to dataframe.
  • Next step is extract lyrics for each of the artist and the title that has been captured. To do this we used lyricswiki. [6]
  1. Match the artist and title extracted form millionsongs dataset and mash up with the lyricswiki site to extract lyric and add it to dataframe.
  2. Continue to process to all songs eventually ending up with downloading of lyrics of all songs in subset of millionsongs dataset.
  3. Now we can clean the data little bit. The first step is to get rid of songs for which no lyrics have been found. These can be instrumental songs or songs for which records are missing in lyrics wiki
  • Next is to get rid of non-English songs. To do this writes a function using NLTK package. It will use English vocabulary to verify percentage of non-English words in a song. It is natural that English songs do contain words which are in English dictionary or even from other languages, so instead of strict filtering we removed only those which have more than half as non-English words. So we end up with English songs’ lyrics dataset.
  • Now this is converted to csv format so that it can be trained
  • The code for above process is included along with github project

2.3     Environment Setup

  1. In order to train the model substantial computational power was needed. So we procured a 32 core , Ubuntu Linux server in the cloud
  2. Later a Docker was installed in machine which comes with python 2.7 and other packages to meet solution requirements. The package has few dependencies which were not listed so we preferred docker option.
  • Despite this the model took very long time to train , almost 24 hours to complete single training cycle which consisted of 50 epochs and 50000 iterations . It should be noted that our dataset was still moderately small, just few thousand song’s lyrics.



  • 3. Training Methods

        I.            Image to Text Captions Model

  • Cloned the Github project for NeuralTalk. Train the model on flicker dataset. This took quite a while to complete as it ran for more than 40 hrs. on cloud sever.
  • Some sample results are below.


        II.            Contextual Char RNN model

  • Clone the Char-RNN project from Github
  • The model is first trained for testing purposes on poems and later on songs dataset.
  • To do this we can briefly describe how char-rnn algorithm is implemented. (refer file in the folder)
  • Define the parameters learning rate etc. Use Adagrad gradient method . The hidden state variable h is used twice: one going vertically up to the prediction (y), and one going horizontally to the next hidden state h at the next time step. During backpropagation these two “branches” of computation both contribute gradients to h, and these gradients have to add up.
  • Give the RNN a sequence of inputs and outputs (seq_length long), and use them to adjust the internal state
  • Initialize the model with random seeds.
  • Create x matrix which holds 1-of-k representations of x and predicted y
  • The normalized probabilities of each output through time holds state vectors through time
  • Create Forward pass for RNN model
    • Find un-normalized log probabilities for next chars
    • Find probabilities for next chars
    • Use softmax (for cross-entropy loss)
  • Backward Pass-compute gradients going backwards
    • Find updates for y
    • Backprop into h and through tanh nonlinearity
    • Find updates for h
    • Save dh_next for subsequent iteration
    • Update RNN parameters according to Adagrad
  • Open a given text file
    • Make some dictionaries for encoding and decoding from 1-of-k
    • Get Insize and outsize are len(chars). Use hidsize ,. seq_length and learning_rate
    • iterate over batches of input and target output inputs to the RNN the targets it should be outputting
    • Calculate losses over the iterations
  • Get Results

To adjust to structure of songs this need to be tuned little bit. First we look at all the poems and assign each distinct character a number. The number of distinct characters is called the vocab size and we denote it by V. Given a poem of length N we can encode the poem as a NxV dimensional matrix. We encode the labels as a N dimensional vector where the k th component is the label of the (k+1) st character.

To efficiently train our models it  want to use “mini-batch” RMSProp updates, however this is a bit of problem since our data don’t all have the same length. Given our batch, we find the longest stanza (say of length M) and then we pad the matrices of the other poems in the batch with rows of zeros until they are all NxV. Now it can be coded as a BxNxV dimensional 3-tensor. To remember where these rows of zeros are located, a BxN dimensional mask matrix of ones and zeros representing real and phantom characters, respectively. When forward pass using this batch the model will compute what it thinks the most likely next character is and uses this prediction to compute the loss. However, the model doesn’t know the difference between the real characters and the phantom characters and just merrily computes loss for all characters. That is where the mask matrix comes in; the phantom losses get multiplied by the zeros in the matrix and so they don’t get accumulated. Basically, the model will make predictions one character at a time. Given a character, the model computes a score for each possible character. Model is using a Softmax loss function and so these scores can be roughly thought of as giving a probability distribution. The next character is then selected by sampling this distribution. Sampling continues until your model decides the song has ended.

Important parameter for diff results here is -temperature, which takes a number in range (0, 1]. The temperature is dividing the predicted log probabilities before the Softmax, so lower temperature will cause the model to make more likely, but also more boring and conservative predictions. Higher temperatures cause the model to take more chances and increase diversity of results, but at a cost of more mistakes.

For lyrics dataset we found that at higher probability results were stream of machine characters  /symbols only. In fact above 0.65 results loses the structure gets jumbled up.

Below are some of better results at temp t=0.4.Below are the sample lyrics generated by the model. (The results are copied to textpad from terminal window for better viewing).


This slideshow requires JavaScript.


The results were definitely interesting. The model was able to pick up the concept of song, split into stanzas, had added spaces between the stanzas (yes those spaces are generated by model and not manually added, including some punctuation marks). It also generated results of average song length and had repetition present. The model however didn’t pick concept of rhymes (may be since it depends upon sound than text) and results overall didn’t look that impressive so we decided to tune the model.

So next tuning step was for the songs is split into stanzas, with one, two or at max three lines. This dataset is called “refined_lyrics”. The model now run with 50000 iterations, batch size and seq_length is also changed by trying multiple combinations. Below are the results received.


These results look much better, in the sense that it makes much more sense at least at each sentence level. The model does also keep almost similar length of output for stanzas.

4. Results

The char-rnn model can be seeded with the input sequence so that the next set of output sequence is generated based on probability of seed provided. The next step is to take caption generated for an image from first model and provide it as a seed to next model. So if I seed the model with word kiss – I can get below results from the model.



This is then combined with the images captions received from first model so we can get. The seed provided to second model comes from caption of first model. The results obtained thus are something like this. The above example is last one below. To do this run script

Notes for below – In first example keyword from caption was “door” , in second image “sun” , in third “song” and in last image “kiss”




5. Conclusion

Creating Art using machine learning algorithm opens new possibilities of what all can be done in this field.  What we have done in this project is a modest wrapper around algorithms which are recently being developed in this field and extended it to different areas. The results were surprisingly impressive and model was able to understand & generate decent quality stanzas. There were of course many junk output as well (not shown above), so it may not pass “Turing Test” just yet but is surely very interesting method.

Use of RNN did came throw up some challenges though like extremely long training time even on small dataset , however results were surprisingly efficient. The predictions gave good results for lyrics with probability distribution around 0.3 to 0.5, before creating truly random non-English output of streams. This is in contrast to other text style models which work better at higher probabilities. This was one of the key findings of doing this & can be used to change the algorithm so that it converges sooner and run faster. This project serves as an contribution to fast growing field of creating abstraction using algorithms , so that models can become even better.

The original objective of the project was to create an RNN model which expands the concept of the contextual learning to “emotional moods”. This can be done by using training dataset of different moods /genres of songs (e.g., happy, upbeat, sad etc.) It would also take into account beats and notes of the songs (like using guitar compositions of song). It would also include changing base algorithm of char-rnn to include vectors of audio, different hidden vector h and sequencing method. The model itself will need different function to work. The eventual outcome of such a model will be generating the song (lyrics plus guitar tunes) for a given context. This was unfortunately need to be scoped down due to variety of reasons , primary being very long training time required even for very small dataset. Another biggest hurdle we came across was to get songs dataset since they are protected by copyright law. (Using such a method can throw interesting non-technical challenges as well, since if such a song is to be published then who will hold the copyright on it? Does training dataset holders can make a claim on it? ) However, we intend to continue this work in future & complete original objective.

Future Work

  1. Google Brain recently (in just last few weeks) released their own version image to text caption model. This has also been added as new package to tensorflow. We became aware of it after we put effort into these open source models but next step should be do this work using Tensorflow and compare speed and accuracy of results.
  2. The algorithm itself remains largely trial & error. We tried with different number of layers but to no great difference in output. One challenge was coming with a contextual accuracy parameter which measures accuracy of created lyrics based on “does it make sense for humans scale”. Proposal of some sort of technical criterion which shall be used to measure accuracy of such models can be very important future work in this regard.
  • The idea itself can be extended to audio and video. There is a strong correlation in movies to action on video (fight, romance, violence, comedy) and background music played. Eventually it should be possible to extend multi-model learning in audio-visual domain; predict/generate background score for short videos (Even Recreate a Charlie-Chaplin movie with its background score generated by Machine Learning)!!

Code Instructions

PS- I will shortly update this with link to github project for all of below code.

  • Flicker Image Dataset- This is not included in submission due to large size. Can be provided if required.
  • Lyrics songs dataset- This is not included in submission due to large size. Can be provided if required.
  • Datacollection_lyrics This is used to extract the lyrics dataset. The code has comments and is quite straightforward to run based on comments. Please note it may take a while to run this script.
  • To train image to caption model. Clone the Github project for NeuralTalk
  • Train the model on flicker dataset. Run code to do the multimodel-rnn training. Please note than training will take long time to complete. The model is not included due to size but can be obtained from Github.
  • Run code and get predictions in file “results.json”. (This is included in submission).
  • The results can be viewed using html page visualize_result_struct, However since we were working on cloud server
  • To train the char-rnn model, clone the Github project.
  • Include extracted data in \data folder. The models trained will be check pointed and added under \cv. In submission we included two trained models on full songs dataset and also on refined lyrics dataset. Please note than training will take long time to complete.
  • To get sample results for full songs lyrics dataset run below command. The option –n 1 limit results to 1 song and t adjust the temperature.

python -m ‘finalcheckpoint_baseline_1.926.p –n 1 –t 0.3’

  • To get sample results for refined songs/stanzas lyrics dataset run below . The option –n 3 limits results to 3 lines/stanzas and t adjust the temperature.

python -m finalcheckpoint_baseline_1.492.p –n 3 –t 0.3′

To prime the output, seed can be supplied using –s

  • To get final results, use the json and char-rnn seeds with use of “” script.

 Note- We used docker image to run the project. If now below can be used.

  • Install python 2.7 on Ubuntu server with dependencies mentioned in requirement.txt page. This doesn’t includes some hidden dependencies as well for various packages which we encounter while running project. Docker image is preferred option.


  1. Original Paper by Author on Image to Text –
  2. Multimodal RNN –
  3. Semi Supervised Images Flicker 8K dataset –
  4. Poetry Dataset –
  6. Million songs Dataset-
  7. Lyric Lookup Wiki –
  8. Github Project-
  9. Github Project-

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s