Social Network Extraction Unsupervised

In the era of information technology, the two developing sides are data science and artificial intelligence. In terms of scientific data, one of the tasks is the extraction of social networks from information sources that have the nature of big data. Meanwhile, in terms of artificial intelligence, the presence of contradictory methods has an impact on knowledge. This article describes an unsupervised as a stream of methods for extracting social networks from information sources. There are a variety of possible approaches and strategies to superficial methods as a starting concept. Each method has its advantages, but in general, it contributes to the integration of each other, namely simplifying, enriching, and emphasizing the results.


Introduction
By follows the definition of data science [1,2], which confirms that there is a task of disclosing knowledge of big data [3]. Thus, social network extraction (SNE) from information sources is data modeling that transforms it into knowledge by revealing the existence of social actors and the possible relationships between them [4]. Formally, it states that there is γ: A × A → R where A is a collection of social actors, and R is a collection of relationships. γ = <γ 1 , γ 2 > for γ 1 (1:1): A → V and γ 2 : R → E reveals that γ: SNE (A,R) → G(V,E) denotes the extraction of social networks with v i = γ 1 (a i ) where v i in V and a i in A, i = 1,…,N and where e j = γ(r k (a,b)) = γ 2 (z a ∩z a ), e j in E, j=1,…,M, r k in R for each a,b in A, k = 1,…,K where z a ⸦ Z, z b ⸦ Z, and z a ∩z a ⸦ Z and Z are semantic sets of interpretation clues [5,6].
One of the streams of the SNE method is the unsupervised method [7,8,9], which is a superficial way of interpretation that produces knowledge about social structures based on information sources [10]. That is a method that generates interpretations that rely on search engines and the meaning of query results [11]. However, this meaning can be enriched through different strategies, of course, resulting in different resultants. Therefore, it is possible to reveal a variety of social networks that then form different communities from different social structures. This article aims to describe the possible changes in the SNE strategy in an unsupervised method stream.

Materials and methods
To get information from big data (or information source) that has easy access involves a search engine [12], where big data Ω is a platform consisting of a collection of data with various characteristics [13]. Thus, for the query q, the parts of big data, namely ω in Ω, reveal the relevance between q and ω. The logical implication ω =>q means that for q = t a , or a query containing the search term t a , where t a represents the name of the social actor [14]. The meaning of ω =>t a gives the cumulative as Σt a against Ω a if t a is true at ω in Ω where |Ω| is the cardinality of big data whereas |Ω a | as a hit count. Ω a is also called the results of clustering of information space based on the information in the query q = t a where Ω a ⸦ Ω and produces not only the hit count but also the snippets, which is a summaries of the information around the query [15,16].
Specifically, on the Internet, the implementation as submission of a query into a search engine produces a web snippet (snippet) that contains at least three elements, namely the URL (uniform resources locator) address of the web page ω, the web page header (title), and the summary of information [17]. The title of web is an important piece of information from a summery point of view, while the summary is part of the web page body. Snippet generally contains ±50 words around the query q. Thus, number of words in a snippet is |s| = ±50 words [18]. snippets around t a is |Ω a |. It means that there is |Ω a | URL address, |Ω a | titles of web page, and |Ω a | summery of web pages [19]. In other words, for each t a and |Ω a | = l there is a collection of snippets L a = {S t |t=1,…, l, t a in S t }. Suppose w is a word, w in ω and for |Ω a | > 0 then L a is a bag of words (BoW) [20]. Technically every word on a web page has a value, that is p(w) = |w|/|s|, which is the number of the same words |w| compared to the total number of words on the snippet |s|. Suppose |w| t is the number of same words in the t-th snippet, the value of each word w in BoW [21,22,23], that is p(w h ), i.e., p(w h ) = (∑ t=1,…,l |w| t /|s| t )/t (1) h = 1,…,H, H is number of words uniquely in BoW. The normalized value pr(w) is calculated as Similarly, for a query q = t a ,t b , or a query that forms a cluster of information as Ω a ∩Ω b , if hit count |Ω a ∩Ω b | > 0 then there are one or more snippets about the contents of q. The BoW of Ω a ∩Ω b means that there are p(w u ) and and other similarity measures such as cosine, mutual information, overlap coefficient, dice coefficient, etc., which are generally expressed as sr = sim(|Ω a |,|Ω b |,|Ω a ∩Ω b |) [40,41]. So, to build the strength relation (sr) between a pair of social actors is as expressed in the algorithm in the following procedure [42]: The naming of Algorithm 1 is the basic superficial method (BSM), which reveals the strength of the relation between two social actors. For example, for a social actor t a = Mahyuddin K. M. Nasution based on a Google search engine, the query returns |Ω a | = 121,000 hits, while for a social actor t b = MarischaElveny the query returns |Ω b | = 2,130,000 hits. In semantics, the search for each social actor's name is declared as an occurrence. Every pair of occurrences to produce a relationship between social actors semantically involves a co-occurrence directly due to the existence of |Ω a ∩Ω b | = 1,410 hits as a result of the query q = t a ,t b [43,44]. Then sr becomes 0.000627, or as shown in Figure 1, a graph of a social network [45].

Results and discussion
The automatic implementation of the BSM on a basis can involve the Python programming language, for example, to access information using the yahoo search engine [46], for example: page = urlopen("http://search.yahoo.com/search;...nama1,nama2...").read() where "..." is the completeness of the query that directs nama1 and nama2 as the information sought. Two variables nama1 and nama2 contain the names of social actors based on the query concept q=t a ,t b , and by involving the hit count metadata search commands from the return results of the search engine on the page has a co-occurrence. Usually, a page apart from containing the hit count also consists of one web page containing 10 snippets maximum even though |Ω a ∩Ω b | greater than 10 hits [47]. The page consists of a URL address, header, and summary of the web page mixed with tags. To produce plaintext is by reducing tags of the web the Python programming language has equipment with special natural language processing functions that make it easy to generate BoW from a snippet page. It is to enrich the social structure [48].
Strategy change in BSM is dealing with query content submitted to search engine. Using well-defined social actor names in quotation marks gives a different hi count. For example, q = "Mahyuddin K. M. Nasution","MarischaElveny" involving the Google search engine produces |Ω "a" ∩Ω "b" | = 774 where |Ω "a" | = 6,740 and |Ω "b" | = 2,470. By involving Eq. (3) yields sr = 0.09175. This change in strategy affects the method of extracting social networks from information sources. Call it as the pattern superficial method (PSM). It aims to emphasize the strength relation between social actors [18].
One of the contents of the snippet is the URL address of a web page where the query content is located. The canonical form of URL consists of components in U = {s,d,p,q} = {scheme,authority,path, query}, which form the string s://d m . ... .d 2 .d 1 /p 1 /p 2 /…/p n-2 /x, x = p n-1 or x = p n-1 ?q [18]. Thus the URL has n layers and separating each part by a slash. Each occurrence indirectly presents maximal ten URL addresses on one page for |Ω a | > 0 or |Ω "a" | > 0, for example. The similarity between URL addresses of two occurrences can form a relationship between two social actors. Snippets of q = t a have |Ω a | URL addresses, as well as snippets of q = t b produce |Ω b | URL address. Two sets of URL addresses, it is possible that there is the same URL address between the two, similar URL addresses between the two, or different URL addresses. The exact same URL address forms a strong semantic relationship, in which the co-occurrence of the names of social actors is on the same page. Similar URL addresses reveal that the initial layers of the URL are the same, while the ending is different, and it semantically reveals the co-occurrence of social actor names on the same page of a document but between the two social actor names there is a three-point barrier "…", which implies a different meaning of the relationship [38]. For example, if an author is citing scientific work from another author, both authors are usually limited by "…".By parsing the URL layers from dissimilar to the same layer, it is possible to construct a similarity and involve measurements from Eq. (3). Or to construct a new similarity adapting the URL address, for example, sim(a,b) = 2|ab|/(|a|+|b|), sim (a,b) in [0,1] [49], where |ab| is the cardinality of |a∩b|, while |a| and |b| respectively the cardinality of the vectors of t a and t b . It is an approach that involves changing this strategy. Let's call it as the underlying superficial method (USM) [10], have other variations. The first variation when the query involves a pattern of names, where the URL address in the snippet is in Ω "a" , for example. The initial variation or USM becomes the basic underlying superficial method (bUSM), while another variation involving the pattern becomes the pattern underlying superficial method (pUSM). In contrast, co-occurrence will result in a list of URLs that are the same between the two social actors. Thus, the cardinality of co-occurrence is proportional to |Ω a ∩Ω b |, which means that there is |Ω a ∩Ω b | URL address on-page. Comparative formula between |Ω a ∩Ω b | and n layers of URL addresses, namely n/(|Ω a ∩Ω b |) apply |Ω a ∩Ω b | >n, |Ω a ∩Ω b | = n, or |Ω a ∩Ω b | <n. To get the strength relation requires normalization so that sr in [0,1]. This strategy is also known as cbUSM while another variation is cpUSM, by changing the charge from Ω a ∩Ω b to Ω "a" ∩Ω "b" .
For example, one of the URL addresses in the snippet list for t a is https://publons.com/researcher/2908750/mahyuddin-k-m-nasution/ while in the list of snippets for t b is https://publons.com/researcher/1730428/marischa-elveny/ each has four layers, both URL addresses that are not exactly the same but have similarities. In addition to the two initial layers associated with the Publon site and the naming of the researcher community, the next two layers each provide researched and researcher names. Each |a| and |b | while the value is four, whereas |ab| has a value of 2, so sim(a,b) = 2|ab|/(|a|+|b|) = 2(2)/(4+4) = 4/8 = 0.5 [10,49]. Accumulatively for all URLs in the snippet of either t a or t b , it will produce |a| and |b|, in the same way, the accumulative value for |ab|. If the measurement involves vector values of |a|, |b|, and |ab|, then for each of the approaches from bUSM, pUSM, cbUSM, and cpUSM have implemented different strategies and produced four measures of sr.
Based on both titles and summary of web pages in the snippet provide a set of words. The sets come from the occurrence of each author or the co-occurrence of two authors, where each word in BoW has a value according to Eq. (2) [5]. A set of words according to each social actor based on occurrence produces a vector as |a| or |b| respectively of Ω a and Ω b , while the same set of words based on the occurrence of a pair of social actors produces a vector as |ab| of Ω a and Ω b , or |ab| 2o . In contrast, the co-occurrence Ω a ∩Ω b pair of social actors also produces vectors such as |ab|, or |ab| c . |ab| 2o is not always the same value as |ab| c . So, there is a variation of the measurement that results in a variation of the strength relation, sr, which involves measuring |a|, |b|, and |ab| 2o , measurement with |a|, |b|, and |ab| c , and measurement only involve |ab| c by constructing the normalization of |ab| c based on their mean values. When the measurement strategies vary according to the data model, there are various adaptation methods, namely the occurrence description superficial method (oDSM), basic description superficial method (bDSM), and co-occurrence description superficial method (cDSM). By changing the content of the query, i.e., involves a well-defined name, also there are indirectly three different variations, namely the pattern occurrence description superficial method (poDSM), the pattern basic description superficial method (pbDSM), and the pattern co-occurrence description superficial method (pcDSM) [14,18]. Social network extraction methods in the unsupervised stream generally prioritize the means of access to the available information space or big data. Those access tools, for example, search engines, of course, involve queries. Besides, access to the information space is through the available log system and the granting of authorization to the information space or database. Several access tools provide easy entry to specific information spaces through strategies that involve special formulations [5]. To reveal the relationship between social actors besides getting the occurrence and co-occurrence, it also involves a measurement which results in the value being in [0,1]. Usually, it uses similarity but does not rule out using the average from the measurement. The methods always accompany by a way to evaluation approach by disclosing information through surveys and involves measuring recall, precision, or F-measure [44,50]. All the approaches and strategies that make up the method reveal that there is the simplest method, and other methods can enrich by involving additional information that explains the formation of a relationship or community. Then, each relationship has a confirmation. Usually, its importance involves a threshold. Based on Table 1, there is an average formulation, µ = (Σ k=1…6 sr k )/6 (4) Eq. (4) integrates measurement by completing the initial enrichment, meanwhile η = (Σ k=7…12 sr k )/6 (5) It integrates measurement by asserting measurement and enrichment, Figure 2. So, integrated measurement is µ + η in unsupervised stream.

Conclusions
Social network extraction involves access tools, search engines, queries, hit counts, similarity measurements, generally recognized as superficial methods. This method by changing the strategy provides enrichment and confirmation of the measurement results. These methods becomes important in the extraction of social networks from information sources. The next task of this research is to reveal the complexity of the social network extraction method