August 25, 2015 · textmining ·

the data of long distance lovers

It is a truism to say that relationship is hard to maintain. It might be even more difficult when the Atlantic Ocean separates them. But thanks to internet and those big data companies that feed themselves with our personal informations, we can now have a real relationship even with 5,500 km between us. Real in sense of small talks, common projects, and fights... Only the means differ from the conventional relations as the acoustic vibration produced by the mouth is replaced by 0s and 1s traveling through transoceanic cables.

To illustrate this point, here some analysis of our daily discussions through the phone app viber. Viber allows their users to get a nice and well formatted csv file containing the date, the sender, and the content of each message. So here's a few graphs that summarize it.

number of messages

The log hold slight more than a year of data and exactly 20,846 messages, she is responsible for 51.2% of the total. The proportion of sent messages is decently equivalent even if he has sent 520 less of them. Let's state that it is due to the more talkative characteristic of the girls in general...

It is interesting to continue the comparison between Her and Him with a focus on the time taken to respond to a message.

time to answer a message

I like the previous graph that shows that more than 60% of messages from Her and Him are answered within 2 minutes. However, we see that he has the tendency to let her wait more ! even up to more that 8 hours. The explanation to this trend is that most of this long answers are due to message sent during his night.

Answerer Mean time to answer Median time to answer
Her 38 min 55 sec
Him 81 min 54 sec

The table illustrates the huge difference between the mean and the median time to answer. From this we can say that 50% of his answers are made in less than 54 secondes compare to 55 secondes for her answers.

Comparison is no longer needed as the viber log is common and loving conversation between two persons ! So let's talk about the global statistics of this one year apart exchanged messages.

histogram number of character

No surprise here for the histogram of the number of characters per messages. Majority of the messages are tiny, less than 15 characters. We also see that at least 10% of the messages contains 4 or less characters. Most of the time small messages are quickly sent and are more than enough! Those are mostly smileys, "yep", "hehe", "oui", "yes" and some lonely punctuations.

Speaking of small words, it is funny to focus on our language of of laughter and lover. The following to pie charts perform a similar facebook study of how we express laugh or love with instant messaging. For the laugh, our result slightly differ from the facebook one as the proportion of the "haha" are lower, and the one of the "lol" are bigger than expected. Facebook would qualified us as hehe-er.
In term of e-love, we mostly send "bisous" or "kisses", but some "cuddles" and "smacks" are also part of our e-love vocabulary.


The following figure groups the messages by their hours and reports it as an histogram. Most of the messages are sent at the end of the day (6:00pm to midnight). Let's note that the time used by the log are the one from Paris and explains that almost no messages are sent between 1:00am and 7:00am as those hours are commonly used to sleep. Similarly no messages are exchange between 8:00pm and 1:00pm as it is also sleep time in Boston (6 hours of difference).

histo hours

Here is a calendar with the number of messages shared for each day. The more red the day is, the more text were sent. Usually, less than 50 messages are sent each days. One this graph, we see blank days in August 2014, Christmas 2014, May 2015, and August 2015, those are obviously time spent together in real life !

calendar plot log

Finally let's have a look at the most used words in the conversations. In the word cloud, the size of the word are a proxy for its frequency, the color groups the words with similar frequency. So purple is for the most used words, then in decreasing order we have green, blue, and red.

As a bonus, here is the two personal wordcloud:


The code that generates all the graphs are available on my github. However the csv file is not public for obvious reasons.

