This explorative report focuses on analysing different methods of data mining to predict profitable customers of a dating site

‘. . . Knowledge Discovery is the most desirable end-product of computing. Finding new phenomena or enhancing our knowledge about them has a greater long-range value than optimizing production processes or inventories, and is second only to task that preserve our world and our environment. It is not surprising that it is also one of the most difficult computing challenges to do well . . .’ (Wiederhold, 1996).

The main objective of knowledge discovery in Data Mining lies in the finding of data patterns. The knowledge about the current customers can be used to predict profitable customers based on their personal information. The second key aspect is to match individual customers based on their personal information.

The dataset analysed is derived from the customer database of Australia’s largest dating site with over 1.9 million members. The dataset contains static activity and dynamic activity. Static activity includes all personal, demographic and interest information entered by the customer at its registration. The emails sent, channels communicated and kisses sent describe the dynamic activity.

— How do users interact? — Who is likely to pay money based on static behaviour? — Who is likely to pay money based on dynamic behaviour? — What makes a person purchase a stamp?

The basis for the report is the user behaviour analysis. After a general analysis the focus is laid on determining which users are likely to pay for the service. This includes dynamic and static data. In the final step the combination of the findings is used to propose an implementation strategy for the future development of the website.

2 Related Work

Online social networks and identity representation are active research areas with input from computer sciences, statistics, sociology and psychology. Studies on psychological aspects of social identity representation examine the social implication of displaying public identities (Donath, 2004). The aim of this paper is the analysis of the interactivity between the users of an online dating website and how that influences their payment behaviour.

Toma (2008) addresses the self-presentation issue by observing the characteristics of users to establish the truth about online dating profiles. Hu and Zeng (2007), also use a framework to predict users’ identity upon their self-presentation history. While their proposed algorithm achieved high accuracy on prediction, their method is not able to clarify if the predicted traits are real or fabricated.

There are some recent academic studies on online social interaction using popular networks. Carverlee and Webb (2008) studied the characteristic of MySpace profiles based on facets of this social network. This paper has similarity to our work, however the focused was to identify elements of sociability and explain the use of language within different type of gender. The works on other social networks such as Facebook also focus on identity presentation and information sharing in student networks. Acqusiti Gross (2006) and Tufekci (2008) also examined the disclosure behaviour on MySpace and Facebook users in correlation to privacy issues. The authors proposed a methodology for clustering and identifying similarity in user’s behaviours on YouTube data. Lerman and Jones (2006) used a small data sample from Flickr and found that the social network is used to locate new content in the site. Nowwell (2003) investigated co-authorship networks in physics to test how well different graph proximity metrics can predict future collaborations.

The paper at hand focuses on analysing the monetary aspect of an online dating website based on the user profiles. The company will benefit from the resulting prediction rules. Similar to the work of Carverlee and Webb the basis is the analysis of the user behaviour. This is extended by a more prediction-orientated analysis, not to be found before in scientistic literature in the context of dating websites. Accordingly, the website can develop a focus on the target customers and try to attract the potential customers.

3 Conceptualization of Processes of Discovery

Figure 2 depicts the process conceptualization of the report. Its structure is based on Figure 1. For the User Behaviour Analysis the first overview is done with a Regression Analysis. It is used to show the influence of the different behaviour attributes. Regression is a powerful tool to analyse data with interval target variables (“Stamps”). It requires the data to be cleaned before. So missing values have to be imputated and skewed variables have to be modified to achieve a good result. Afterwards a Cluster Analysis helps categorizing and analysing the data more detailed. The resulting behaviour is see the site explained in written form to build the basis for the next steps of the analysis.

In the second step the focus is laid on the payment aspect. The aim is to analyse which customers will pay for the service based on the dynamic and static behaviour. The Neural Network is used to classify the data and give an indication which variables are relevant. The Decision Tree is used to compare the results of classification.

In the third step the resulting rules are combined to form a rule set predicting the likeliness of the customer with special data to use money. In the last step the rule set is converted into an implementation proposal.