Quantile Normalized Neighbor Combinatorial Machine Learning Based Recommendation in Digital Marketing

: The scientific advancement of the contemporary years has set industries on the move. The advancement in marketing has led to the point where reshaping to digital movements is essential. Even though it appears to be a plunge for marketers, as a matter of fact, all mechanized applications and systems that are designed on the basis of artificial intelligence only reduces the complication of conventional targeting and customization procedure. In several applications, the platforms utilized for online promotion carry algorithms for recognizing the best combinations whereas in other cases, the business establishments or institutions involving digital marketing take advantage to design and execute in-house personalized arrangements. As a case study, a method called, Quantile Normalized Neighbor Combinatorial Learning-based Recommendation (QNNCL-R) is applied for generating new leads that will ultimately become customers (i.e., promoting student higher education to admission branding in our scenario) via twitter dataset. The data obtained from the twitter dataset (i.e., higher education) is fed to the recommendationsystem. Then, the relevant set of features and event labels (i.e., tweets) is selected by Quantile Normalized Chi-square Feature selectionand a Neighbor Combinatorial Learning-based Recommendation algorithm with the best performance is selected for the recommendation process for higher education. QNNCL-R method is compared with other algorithms and indicating that QNNCL-R method performs better than other methods.


Introduction
The internet is a commanding mechanism and it can be utilized to fascinate customers, strengthen credibility and broaden a product or service's brand.Social Media (SM) recommends a plan of action where users or people communicate and cooperate in a virtual manner.Users' opinions are supervised and determined by repeated advertisements that they come across on numerous micro blogging and platforms involving social media.Moreover, business analysts utilize SM for business scrutiny, corporate awareness meeting and product cognizance.The advancements in the business world have handled the purposes of social media to become as one of the essential instruments for marketing plan of action specifically in brand health and brand development.
To obtain uninterrupted incomes and increased dynamic customers or users, crucial business players should realize the actions and purchase inclinations of buyers.To forecast the buying options of purchasers, data pertaining to purchase objectives and inclinations have to be acquired.In (Arasu et al., 2020), Machine Learning integrated Social Media Marketing (ML-SMM) was proposed that involved two steps, Text mining, Machine learning integrated with social media marketing and finally, analysis of ML-SMM analysis via WEKA tool.
To start with the concepts involving social media marketing, application of machine learning was utilized and then combined with the WEKA machine learning tool with the objective of predicting online consumer behavior, therefore ensuring efficient marketing.With this finer reporting communicating potentialities were ensured laying foundations for precision and recall.Despite improvement observed in terms of both precision and recall, the processing time involved in predicting consumer behavior for lead generation was not focused.To address this issue in this work, Quantile Normalized Chi-square Feature selection algorithm is designed usinga preprocessing library that first tokenizes the tweets, followed by which computationally efficient relevant tweets are selected using Quantile Normalized function, therefore contributing to minimum processing time involved in overall lead generation process.
A machine learning method utilizing social media content marketing towards brand health of a company was proposed in (Pappu, 2019) catering to the scope or paint perceptions in the social media convention.Moreover, the important of paints in the social media convention involving social media posts of the company were also discussed and also measure of relevancy followed by discussions made in social media pages were also included.
Finally, the influence of periodic posting of content was identified and means were made to discuss in detail using machine learning techniques, therefore contributing to prediction and forecasting accuracy.Though improvement were observed in terms of prediction and forecasting accuracy, the false positive rate involved in generating lead for social media content marketing was not addressed.To provide solution to this issue, Neighbor Combinatorial Machine Learning Lead Generation algorithm is designed that reduces the false positive rate using combinatorial function.

Article Contributions
Aiming at improving the traditional recommendation model to obtain good values of recommendation accuracyand at the same time minimizing the overall processing time and false positive rate, QNNCL-R method is applied to distance learning twitter dataset.We propose a Quantile Normalized Chi-square Feature selection model that utilizes quantile normalization function to remove irrelevant tweets, then the Neighbor Combinatorial Learningbased Recommendation to provide accurate recommendations from neighbor users.Experimental results reveal that ourproposed QNNCL-R method approach provides better recommendation results with minimum processing time and false positive rate than the conventional recommendation methods.

Article Organization
The rest of article is organized as follows.Relevant literature is reviewed in Section 2. QNNCL-R method is derived in Section 3. Experimental settings are provided in Section 4. The detailed discussion with comparison of state-of-the-art methods is provided in Section 5. Section 6 concludes this work.

Related Works
A complete literature review of considerable empirical contributions made so far in this research area was handled in (Saura, 2020).However, with the existence of notable challenges from pessimistic electronic word-ofmouth and irritating online brand presence being an issue, an aggregate discernment from various leading experts on issues pertaining to digital and social media marketing was investigated in (Dwivedi et al., 2020).
Digital business platforms (DBPs) like, eBay, Google, and Uber Technologies have perceived immeasurable heightening, the role of marketing in helping DBPs succeed was proposed in (Rangaswamy et al., 2020).Moreover, social media marketing usage in the aspect of small and medium enterprise was designed in (Dahnil et al., 2014).
In (Shen et al., 2020), the value of demand learning via Social Media Exposure (SME) for luxury brand using two-period model was proposed, therefore contributing to cost minimization and accuracy maximization.Customer's attitude towards various brands and their intention towards purchase were analyzed in detail in (Abzari et al., 2014).Yet another detailed focus on the future concerning digital marketing was proposed in (Appel et al., 2020).
An elaborate review of literature on the prominence and impact of a user on social media between September 2010 and September2019 was proposed in (Al-Yazidi et al., 2020).In (Hayat et al., 2019), a keen discussion on Deep Learning architectures via a taxonomy-oriented summary was proposed toward the Social Media Analytics (SMA).A review of literature on the application of twitter across educational domain was investigated in (Malik et al., 2019).A sentiment analysis framework using different deep learning models via activation function with the objective of concentrating on the accuracy aspect was discussed in (Chen et al., 2019).Yet another method using variance-based structural equation model concentrating on social media in higher education institutes was proposed in (Ansari & Khan, 2020).
A systematic literature study on deep learning algorithms was proposed in (Yang et al., 2020).Application of different machine learning algorithms for analyzing the role of sentiment analysis was presented in (Sharma & Jain, 2020).Machine learning and Artificial Intelligence in the area of marketing was analyzed in detail in (Ma & Sun, 2020); (Shah et al., 2020).In (Miklosik et al., 2019) the selection and application of machine learning tools for analyzing the impact made in digital marketing was investigated.Motivated by above methods, QNNCL-R method is proposed.

Quantile Normalized Neighbor Combinatorial Learning-based Recommendation (QNNCL-R) Method
The issue of predicting a lead generation method for student population and thereforerecommending a specific higher educational institute is formulated as a Multi Criteria Recommendation problem.At first, the whole data set is divided into eight sub-datasets.Each sub-dataset represents the data of a specific higher educational institute and consists of different features, some of these features are trivial and other is informative features.The computationally efficient dimensionality reduced tweets are obtained by Quantile Normalized Chi-square Feature selection model.We have proposed an adaptive recommendation system for predicting a specific higher educational institute based on the lead generation via relevant tweets selected.The computationally efficient tweets data selected are fed as input to the proposed method and the recommendationsystem will suggest an institute to the prospective student based on the lead obtained.

Figure 1. Block Diagram of QNNCL-R Method
Figure 1 illustrates the block diagram of QNNCL-R method.QNNCL-R method is split into three sections.With the distance learning twitter dataset provided as input, first, preprocessing is performed using preprocessing library to significantly tokenize the tweets.Second, feature selection for relevant tweets is obtained by applying the Quantile Normalized Chi-square Feature selection to the preprocessed features and tweets.Significant lead generation mechanism is proposed by Neighbor Combinatorial Machine Learning-based Recommendation.

3.2Quantile Normalized Chi-Square Feature selection
Feature selection refers to the process of identifying a subset of functional event labels and discarding irrelevant event labels (i.e.irrelevant tweets).Feature selection process estimates accuracy of lead generation for education services and assists eliminating the futile correlation in the data (i.e., tweets) that might reduce the accuracy (i.e., lead generation accuracy).With the apt selection of event labels, insignificant variables are removed, therefore enhancing the accuracy and classification performance involved in lead generation for educational services.Chi-square Feature selection model in our work based on filtering model is applied that employs statistical function to allocate a righteousness scoring value to each feature or data (i.e., tweets).The tweets are processed according to their righteousness score, and then, either selected to be eliminated from the data or retained.(2) From (1), subscript '' in chi-squared test '  2 ' symbolizes degree of freedom, '  ' and '  ' refers to observed preprocessed tweets and expected preprocessed tweets respectively.Degree of freedom is obtained based on number of magnitudes for one feature (i.e, tweet) '' and number of magnitudes for other feature ''.After obtaining the chi-square statistic, the '' is identified and resultant '' value, null hypothesis is either accepted or rejected.
To find reliable measure other than using p-value from chi-squaredtest, we proposed a new measure by adding a Qunatileterm on the '' of feature (i.e., tweets) referred to as quantile normalized p-value '  '.Given a set of arrays in a matrix ' ∈ ', each column of '' is sorted to give '  '.Mean across rows of '  ' is evaluated and this mean value is assigned to each '' element in the row to obtain a matrix '  ′ '.Finally, the normalized values are obtained by rearranging the order of tweets in each column if '  ′ ' to have '' the similar ordering as the original given matrix ''.Novel proposed statistical measure,'  ' is formulated as below.
From (3), '' refers to the significance measure and ' ' relating to degrees of freedom.By utilizing this quantity, we are normalizing on the feature (i.e., tweets) with higher cardinality.Simply to say, we are trying to seehow further by percentage the critical value corresponding to '', the '  ′ (, )', is crossingthe critical value '  ′ (, )' with respect to a given significance level ''.

Neighbor Combinatorial Machine Learning-based Recommendation Model
With selected computationally efficient tweets, the next step in our work forms a Neighbor Combinatorial Machine Learning-based Recommendation system to provide personalized information by learning user preferences.In this model we insert a Neighbor Combinatorial function into the traditional Machine Learning-based Recommendation system.The model is called asNeighbor Combinatorial as it utilizes both the single common tweets Recommendation system is utilized to extract those users who have provided similar scores to similar items (i.e., tweets).The objective here remains in extracting those users provided with similar scores to similar tweets by means of Mahalanobis distance.Figure 3 shows the block diagram of Neighbor Combinatorial Machine Learning-based Recommendation model.

Figure 3. Block Diagram of Neighbor Combinatorial Machine Learning-based Recommendation Model
Assume that there are two users '' and '', then distance between two users with similar scores for similar tweets is formulated as below.
From ( 4), the distance between two users is estimated based on the variances '  2 ' and '  2 ', covariance between two users '      ' and the determinant of quantile normalized value '(  )' respectively.With this, the distance between two users with similar scores for similar tweets using Mahalanobis distance '' is expressed as given below.
From ( 5), the distance between single common tweets of two users are obtained and in a similar manner, the distance between two users '' and '' sharing a set of tweets or the polarity rate '' is mathematically expressed as below.
From the above equation ( 6), '  ' forms the scoring of tweet '' given by user '' and ' ′ ' is the mean of the scores given by the user '' to all tweets and '  ' corresponds to the set of co-rated tweets of both the users '' and ' ' respectively.Finally with the similarity distance estimation, with the objective of obtaining the lead  ′ ∈ (7) From ( 7), the predicted score for lead generation based on which the recommendation for certain tweets made is estimated according to the neighboring users '' who have scored tweet ''.By applying this recommendation with good values of accuracy for lead generation on admission branding is said to be ensured.

Experimental Settings
In this section, QNNCL-R method is used to promptly select higher education therefore laying for accurate admission branding based on the sentiment analysis o the tweets about distance learning.First, tweets collected from distance learning dataset are tokenized, followed by which relevant tweets for lead generation are selected in a computational and efficient manner.Then, the historical data of scores for each user's tweets areentered into the recommendation system as input.Finally, the subjectivity of each tweetis classified as a positive, negative or neutral for all tweets.Experimental evaluations are performed in Python by utilizing distance learning (https://github.com/Bhasfe/distance_learning)dataset that consists of three csv files, i.e., raw files, processed files and sentiment files.Table 1, 2 and 3 provides features in raw dataset, processed and sentiment dataset.Label Comparative analysis of lead generation methods is performed and compared with QNNCL-R, (Arasu et al., 2020) and (Pappu 2019).Analysis is made with different metrics with respect to tweet size.

Case 1: Processing Time
A significant amount of time is said to be involved time in predicting student's tweets for arriving at lead generation.This is because of the reason that tweets generated are in fraction of seconds and analyzing them is a time consuming process.4 shows the processing time involved in obtaining the words in a given tweets.From the figure it is inferred that increasing the user count causes an increase in the number of words in the tweets and subsequently causes an increase in the processing time also.However, with '500' numbers of user counts involved in generating the lead and the time consumed in obtaining the words for single user count being '0.155' using QNNCL-R, '0.180' using (Arasu et al., 2020)and '0.205' using (Pappu 2019), the overall processing time was observed to be '77.5','90' and '102.5'respectively.From this result processing time using QNNCL-R method is lesser than (Arasu et al., 2020)and (Pappu 2019).The lesser processing time is owing to the application of Quantile Normalized Chi-square Feature selection model.By applying this model, not only the tweets are preprocessed or tokenized followed by which computationally efficient tweets are selected via quantile normalization function where normalization is performed with high cardinality, therefore minimizing the processing time using QNNCL-R by 17% compared to (Arasu et al., 2020)and 30% compared to (Pappu 2019).

Case 2: Lead Generation Accuracy
Accurate data are required for better lead generation.The dataset used for experimentation probably have large number of user tweets and those tweets may not be accurate.With improper tweets, result in minimizing the lead generation accuracy.Therefore accurate tweets are required to improve lead generation accuracy.This is measured as below.

Figure 5. Graphical Representation of Lead Generation Accuracy
Figure 5 shows the lead generation accuracy involved in educational services domain.From the figure the lead generation accuracy is inversely proportion to the tweet size.In other words, increasing the size of tweets causes an increase in the amount of time consumed in processing the tweets for recommendation that in turn results in the minimization of the lead generation accuracy.However, with '250' numbers of tweets taken for simulation and '235' number of tweets correctly recommended using QNNCL-R, '220' number of tweets correctly recommended using (Arasu et al., 2020)and '210' number of tweets correctly recommended using (Pappu 2019), the overall lead generation accuracy was '94%', '88%' and '84%'.From the results, lead generation accuracy is better using QNNCL-R.The reason behind the improvement in lead generation accuracy is due to application of Neighbor Combinatorial Machine Learning Lead Generation algorithm.By applying this algorithm a recommendation model integrating Neighbor Combinatorial function and Machine Learning recommendation is designed.With this combination, first multiple co-rated tweets are estimated and then based on the results, Machine Learning recommendation is applied with aid of distance measure obtains similarity tweets.By combining these two functions, accurate and precise lead generation is obtained using QNNCL-R, therefore improving lead generation accuracy by 5% and 10% compared to (Arasu et al., 2020); (Pappu 2019).

Case 3: False Positive Rate
False positive rate refers to the ratio of all negative results that still give positive test end results, in other words, the conditional probability of a positive test result given an event that was not present.To be specific, false positive rate refers to the ratio of all negative recommendations made by the students regarding the education institutes and is measured as given below.The increasing the user count causes increase in FPR.However, with '500' number of user counts considered for simulation and '25' number of user counts wrongly recommended using QNNCL-R, '40' number of user counts were wrongly recommended using (Arasu et al., 2020) and '55' number of user counts wrongly recommended using (Pappu 2019), FPR were observed to be '5%', '8%' and '11%'.From this analysis, FPR is lesser using QNNCL-R.The reason behind the improvement was applying Quantile Normalized Chi-square Feature selectionalgorithm a statisticaltest measure between every feature variable and label variable was evaluated and existence of relationship were analyzed and eliminating independent feature variable.FPR of QNNCL-R was reduced by 19% and 43% compared to (Arasu et al., 2020); (Pappu 2019).

Conclusion
Lead generation via sentiments expressed in twitter messages is important though demanding activities.Most of the current lead generation method via sentiment analysis only identifies the twitter textual detailsand cannot accomplish sufficient performance due to the distinguishing characteristics of information from twitter.Inspired by recent work on machine learning to attain better performance of twitter sentiment analysis for education domain, a method called, QNNCL-R is proposed.First, computationally efficient features are selected by tokenizing the tweets using library function and applying Quantile Normalized function for obtaining relevant features.With relevant features, Neighbor Combinatorial Learning-based Recommendation algorithm is applied by combining the Neighbor Combinatorial and Learning-based recommendation improves lead generation accuracy with minimum processing time and FPR.
Quantile Normalized Chi-square Feature selection model is applied to the preprocessed tweets, therefore minimizing the training time, finally avoiding the dimensionality curse in lead generation.The Quantile Normalized Distance_L

Figure 2 .
Figure 2. Block Diagram of Quantile Normalized Chi-Square Feature Selection Model Figure 2 illustrates the block diagram of Quantile Normalized Chi-square Feature selection model.To start with a statistical filtering model based on thechi-squared test is applied to test the independence of two events (i.e.preprocessed tweets), where twoevents '  ' and '  ' are defined to be independent, if '  () = () ()' or, '(|) = ()  (|) =  ()', mathematical formulation is given below.  2 = ∑ (  −  ) 2 tweets based on the neighboring users.Neighbor Combinatorial function in Machine Learning-based

Figure 4 .
Figure 4. Graphical Representation of Processing TimeFigure4shows the processing time involved in obtaining the words in a given tweets.From the figure it is inferred that increasing the user count causes an increase in the number of words in the tweets and subsequently causes an increase in the processing time also.However, with '500' numbers of user counts involved in generating the lead and the time consumed in obtaining the words for single user count being '0.155' using QNNCL-R, '0.180' using(Arasu et al., 2020)and '0.205' using(Pappu 2019), the overall processing time was observed to be '77.5','90' and '102.5'respectively.From this result processing time using QNNCL-R method is lesser than(Arasu et al., 2020)and(Pappu 2019).The lesser processing time is owing to the application of Quantile Normalized Chi-square Feature selection model.By applying this model, not only the tweets are preprocessed or tokenized followed by which computationally efficient tweets are selected via quantile normalization function where normalization is performed with high cardinality, therefore minimizing the processing time using QNNCL-R by 17% compared to(Arasu et al., 2020)and 30% compared to(Pappu 2019).

Table 4 . False Positive Rate Performance Levels using QNNCL-R, ML-SMM (Arasu et al., 2020) and Social Media Content Marketing (Pappu 2019)
, the false positive rate '' is measured based on total user counts '  ' and the frequency of user count wrongly recommended '  '.It is measured in percentage.The resulting false positive rate is shown in table 4.