Web Mining

I.INTRODUCTION

The Internet is expanding at an exponential rate. More individuals are logging on each day with the decrease of the digital divide. Business and consumer entities alike are both experiencing the overwhelming phenomena of information overload. [Pierrakos, 2010] E-Commerce is growing as more consumers choose the World Wide Web as their new shopping medium. The need to discover knowledge from and make sense of this information is needed to maximize company profit and customize what information is delivered to whom.

Basic techniques for information discovery on the user end consist of web directories such as Yahoo (www.yahoo.com), Google (directory.google.com), or ODP (www.dmoz.org). This form of information discovery requires human interaction to maintain a list of web sites on information pertaining to the user’s needs. Other types of user end information discovery methods are search engines such as Google (www.google.com) and Microsoft’s Bing (www.Bing.com). These types of information discovery require the user to consciously search for content pertaining to their area of interest. Recently new advancements in web mining technology have paved the way for advanced advertising programs and suggestive search. Each of these two new technologies utilizes preferences from past user activity or profiles built by the user.

This paper gives a general overview of web mining as well as providing further information on its three categories. This type of knowledge discovery has been a major attraction for the corporate and marketing world. The goal of this type of research is to provide information on user preferences and activities and derive knowledge from that information to effectively deliver content through personalization. This type of research has been a major attraction with E-Commerce.

This paper is a literature review of several papers dealing with the general topic of web mining. It also consists of literature research findings of the three categories of web mining in order to provide a more in depth understanding of the topic as a whole. Table 1 provides a categorized format displaying the key literature findings highlighted in this paper, by topic and author for quick reference to the materials. The selected references were found from various literary research mediums such as IEEE Xplore, EBSCOhost, CiteSeerX, and Google Scholar. A few additional sources were found from a generational research method.

Table 1: Literature Findings by Category
Web Structure Mining Web Usage Mining Web Content Mining
Kolari and Joshi, 2004 Nina., et al., 2009 Nair, 2009
Pierrakos, 2010 Pol., et al., 2008

The rest of this paper is structured as follows. Section two provides a generalized overview of web mining and progresses into the three main categories in further detail. Section three is dedicated to the potential of future research that could be conducted in the field. Finally section four concludes the paper.

II. Web Mining

Web mining is the set of techniques of Data Mining applied to extract useful knowledge and implicit information from web data. [Nina et al., 2009] This type of research has been an attraction of corporate entities as well as consumers interested in its rising privacy concerns (discussed in section six), because of the type of information it provides. Exponential increases in the information made available by the Internet, has made research on web mining a necessity.

Current and past research on the topic of web mining can be broken down into three categorical types;

I. Web Structure Mining

II. Web Content Mining

III. Web Usage Mining

These categories each require their own specific techniques in order to extract and make sense of information. Each category will be discussed in detail in the following sections. It is important to understand that the categories of web mining often overlap depending on what the implementer wishes to mine.

Since the internet is filled with structured, semi-structured, and unstructured data. This lack of scheme for information provides a daunting task when trying to extract useful information. Technologies such as database, information retrieval and artificial intelligence, machine learning and natural language processing, network analysis and information integration are needed in order to make sense of chaos. [Nair, 2009]

Table 2: Structured, Unstructured, and Semi-Structured data
Structured Data Data that is found in a table, list or tree. This type of data is common in product or price lists.
Semi-Structured Data Data that is found in a web page. Contains various syntaxes such as HTML.
Unstructured Data This type of data is commonly a text document, or multimedia file such as images, videos, or audio files.

The following sections are breakdowns of web usage mining, web structure mining, and web content mining.

Web Usage Mining

Web usage mining is a discovery method used to derive information about the users. This category of web mining provides useful information about the users behaviors and usage pattern. Understanding the users behavior has become of increasing interest to E-Commerce companies. The ability to use this technology to extract information about what the user prefers has paved the way to a new form of marketing and personalization in the web experience. Keeping track of various information regarding the user’s interaction with the web, in general or that of a more specific site has allowed companies to profile the user’s likes and dislikes and direct content toward those preferences. This opens up the ability to direct marketing campaigns appropriately towards those who are more receptive therefore increasing the probability of a sale or selection to use a service.

Web usage mining has attracted a plethora of researchers who have found a great deal of interest in dissecting the internet activity of users. This category alone has a vast amount of information available on every technical aspect. The scope of this paper however will remain a basic view in order to provide a general understanding rather than an in depth technical understanding of the topic.

In the paper entitled “Pattern Discovery of Web Usage Mining” by Nina, et al (2009), retrieving information as well as processing it into a final product is made clear. In order to obtain the usage information there are three main sources; Web Servers, Proxy Servers, and Client Servers. [Nina et al., 2009] Each source provides valuable information in which can provide knowledge of internet activity. On the server side, web servers provide detailed logs which often include the IP, date and time of the request. Proxy Servers generally contain the same information as the web server however in addition they contain all of the connections made in a more global scale. The Client side can consist of applications, often known as spyware or modified browsers, such as Google Chrome which tracks information and send it back.

Once the initial data is collected it must undergo three phases in order to become understandable information. These phases are:

I. Data Preparation

II. Pattern Discovering

III. Pattern Analysis and Visualization

[Nina et al., 2009] Each of these phases uses specific algorithms in order to clean the data, find specific patterns within the data and present the data in a readable view. Often graphs are utilized to show general information as well as specific information that may even include geographical representation from site visitors.

The papers researched provided a general understanding for the need of web usage mining. They also provide a pseudo understanding of the steps needed of transforming this type of log file information to useful information that essentially aids in making proper decisions. These decisions can include web design that maximizes the user’s interaction with the web page, as well as the ability to implement personalization features to attract specific groups. After much review it would be helpful for research to be conducted in real time web mining in order to provide up to the minute monitoring of web usage. This research could potentially increase the ability to combat the dynamically changing habits of the website user. It would also be able to log trends and compare to logs of other days much like that of sales projection services found in Point of Service systems.

Web Content Mining

The web is a constant growing medium for accessing a variety of information stored in different parts of the world. (Pol., et al, 2008) Information such as images, music, video and text are available almost everywhere. Web content mining is a category of web mining that tries to extract information from the web page itself. This type of web mining however is not as easy as the previous technique of web usage mining. Unlike web usage mining which focuses on server logs or information gathering programs, web content mining focuses on the vast amount of web pages found on the web. This presents various problems. The paper by Liu and Chang presents issues and discusses the current technology used to achieve web content mining. A summary of the various issues can be found in Table 3.

Table 3: Issues in Web Content Mining
Redundant information displayed in various syntaxes The Web is dynamic and always changing The Web is a virtual society containing interactions between people, or automated systems
Data exists of all types including multimedia data, tables and texts Various types of information can be found on a page including advertisements, navigation panels, etc. Many web pages provide services that contain input parameters
The Web consists of a majority of Semi-structured information due to HTML The amount of the information is huge and continues to grow Topics of information are limitless, wide and diverse

[Liu and Chang, 2004]

The most common form of web content mining, deals with various structured data types. (see table 2 for definition) These data types include product lists, services, prices or other types of information that can be used to incorporate value added services such as comparative shopping or provide keys for meta-search. [Liu and Chang, 2004] Liu and Chang explain that there are three technologies currently used in order to extract structured data. These methods from the most labor intensive to the least include;

I. Manually written extraction program

II. Wrapper induction or learning

III. Using an automated system such as MDR or RoadRunner

Extracting text from a physical web site requires much more than extracting information from a structured source. Current research in this type of data extraction is closely related to the research in text mining, Natural Language processing, and information retrieval. Other research includes finding common language patterns as well as question-answering methods. Extracting unstructured or semi-structured text faces many of the challenges found in table 3. Specifically Liu and Chang address the concept of noise. Noise can be defined as those areas on the web page that cause a redundancy, such as navigational bars and advertisements. Filtering this type of information from the web page is necessary in order to extract the core data.

After researching the topic it is clear making sense of web content itself is an extremely daunting task. However the benefits of doing so can be found all over today’s web. An example that consumers use every day would be price comparisons of various services or items in order to match up several companies at once. Liu and Chang also bring up the importance of this technology when used to mine opinions from consumers. This type of mining utilizes various reputable web sites containing forums or other types of reviews in order to develop a clear picture of an organization’s success. This information is used in marketing research to develop reports and analyze word of mouth which is extremely important for any company’s success.

Web Structure Mining

Web structure mining provides a ranking or authoritativeness which enhances search results though filtering derived from the web’s hyperlink structure. [Kolari and Joshi, 2004] Unlike web content mining which uses the pages content, web structure mining derives information from the way pages are linked from its links. They type of web mining uses a method which clusters sites linked together to one another and those sites link to it. This creates a web of sites which generally have a similar topic or purpose.

Web structure mining is often combined with web content mining in order to provide a new level of precision. This technology is of great interest to web based search engines which utilizes it in order to provide a greater amount of accuracy. Extracting direction of web links web structure mining provides knowledge on what sites are authoritative from the links from various hubs to the source. This helps provide page rankings to the web sites with the highest amount of links to them. Generally this type of ranking provides the web site with the most information.

In a paper by Kolari and Joshi technology being used for web structure mining is explained. These technologies include the PageRank system which is utilized by Google’s search engine, The Clever system, and an additional approach which utilizes web communities. Each of these approaches uses the web structure mining category in order to rank pages in order to determine the most authoritative web sites in that community.

The first technology being discussed is the PageRank system. This system is utilized by Google in order to populated their results with the most relevant web sites first. PageRank uses a crawler which pre-computes search results before they are returned. [Kolari and Joshi, 2004] The crawler then returns the information about the ranking of the page which includes various aspects including popularity and what pertains to this section the number of links leading back to the page itself. This is used to determine if a user will generally stay at this page or go off to another page to find out more information.

The second technology is the Clever system. This system incorporates not only web structure mining, but web content mining. The system utilizes the query term and compares it to text in or near the anchor text in the HTML page. [Kolari and Joshi, 2004] This system does not only utilize the links to other sites or to the site itself to gain rank but also the sites genre compared to the query. This provides a higher level of accuracy.

The third technology or method is to develop an idea of web communities. Web communities are web sites, often similar in interest, which are linked to each other. Kolari and Joshi describe web communities as web rings. This term was coined because they are essentially connected in a big circle. When mining a web community those interlinking website are analyzed to find sites that simply don’t link back. These types of sites are then considered the community core or the authoritative web site.

All three of these methods of web structure mining provide results based on the pages authoritative value. Each of these techniques are valid and provide a high level of accuracy when trying to find the most relevant information.

III. Future Research

Constant change in markets requires the quick adaption of a business in order to compete in the free-market world. Being the best requires research in order to find what a consumer’s wants, needs and dislikes are. With the internet and its vast amount of users growing it is necessary for web mining research to be conducted. Customer relations, sales and marketing departments should utilize the current web mining techniques in order to gain a fuller perspective of the customer.

Future research in this field should be conducted in both web content mining and web usage mining. Not only is it important to gain knowledge of a consumers internet usage but it is important to find out what their opinions are. Research utilizing both content and usage mining would be beneficial for the industry in order to paint a complete picture of the consumer.

Traditional consumer research was done with a questionnaire or survey. This type of research creates a very limited selection of opinions which narrows or distorts the actuality of the research. Using new web mining and artificial intelligence techniques it could be possible to sample a much broader range of customer opinions from existing forums and not just a selected few sites.

Combining research of these forums with an analysis of consumer browsing behavior would provide a remarkable tool for business customer relations, sales, and marketing. This type of software incorporated with various decision support systems would allow for far more current information to be made readily available for upper management to formulate plans and campaigns.

clip_image002
Figure 1: Areas in Web Mining. [Nair, 2009]

The internet and web mining are still in their infancy. Technologies needed in order to advance web mining must also be researched as well. Figure 1 shows how web mining is dependent on other technologies in order to evolve. Research in these other technologies will aid in web mining becoming a major source of information used to make decisions.

IV. Conclusion

In this paper a generalized overview of web mining was discussed as well as some of the new research that has been done. Web mining was broken down and analyzed into its three major categories of study. Specific research papers were organized per section and discussed according to their topic. Each of these research papers introduced a new aspect of web mining furthering the general knowledge of the topic.

As technology grows so will the internet and the web. Internet based business as well as those not based on the internet all will find their companies researching this vast landscape of knowledge in order to get ahead of the competition. The topic of web mining discussed in this paper will continue to be a great interest in those companies trying to get an advantage.

Abilities to increase personalization when browsing, to optimize web site development for maximum usage, to find information faster, and to find links between vast amounts of web sites

are just a few of the advantages of web mining. Today, more than ever we see companies marketing products to those who are going to buy. Web sites are optimized with link patterns to get the most exposure from each visitor. Each of these examples is of a direct result of web mining.

As more and more research is done in all of web mining’s supporting fields so will the power of web mining. Section three gives the reader an idea of where this research may end up in the future. As this field becomes more and more attractive to the business world more research will be done. The future of web mining will be a future filled with powerful research tools that will become standard for any business.

References:

Xiaohua, H., Cercone, N. (2004). “A data warehouse/online analytic processing framework for web usage mining and business intelligence reporting.” International Journal of Intelligent Systems, 19(7), 585-606.

Weihui, D., Xingyun, D., & Tao, S. (2009). “A Smart Targeting System for Online Advertising.” Journal of Computers, 4(8), 778-786.

Nair, N. (2009). “Mining the Web for Managing Web Content.” IUP Journal of Systems Management, 7(3), 7-16.

Zhang, Q., Segall, R. (2008). “WEB MINING:: A SURVEY OF CURRENT RESEARCH, TECHNIQUES, AND SOFTWARE.” International Journal of Information Technology & Decision Making, 7(4), 683-720.

Nina, S.P., Rahman, M., Bhuiyan, K.I., Ahmed, K. (2009) “Pattern Discovery of Web Usage Mining,” Computer Technology and Development ICCTD ’09. vol.1, no., pp.499-503, 13-15

Nasraoui, O., (2009). “A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites.” IEEE Transactions on Knowledge and Data Engineering, Vol. 20, 2

Kosala, R., Blockeel, H. (2000). “Web Mining research: A Survey”, SIGKDD Explore 2

Tyagi, N., Solanki, A., & Wadhwa, M. (2010). “Analysis of Server Log by Web Usage Mining

for Website Improvement.” International Journal of Computer Science Issues (IJCSI), 7(4), 17-21.

Qiu, L., Li, Y., & Wu, X. (2008). “Protecting business intelligence and customer privacy while

outsourcing data mining tasks.” Knowledge & Information Systems, 17(1), 99-120. doi:10.1007/

s10115-007-0113-3.

Pierrakos, D., (2010). “Personalizing Web Directories with the Aid of Web Usage Data.” IEEE Transactions on Knowledge and Data Engineering, Vol.22, 9

Liu. B., Grossman, R., & Zhai. Y., (2004). “Mining Web Pages for Data Records.” University of Illinois at Chicago, Published by IEEE

Pol, K., Patil, N., Patankar, S., Das, C., (2008). “A Survey on Web Content Mining and Extraction of Structured and Semistructured Data,” Emerging Trends in Engineering and Technology, 2008. ICETET ’08. pp.543-546, 16-18

Kao, H., Lin, S., Ho, J., Chen, M., (2004). “Mining Web informative structures and contents based on entropy analysis,” Knowledge and Data Engineering, IEEE Transactions on , vol.16, no.1, pp. 41- 55

Liu, B., Chang, K., (2004). “Explorations, special issue on Web

content mining” SIGKDD, vol. 6, no. 2, pp. 1-4,

Kolari, P.; Joshi, A.; , “Web mining: research and practice,” Computing in Science & Engineering , vol.06, no.4, pp. 49- 53, July-Aug. 2004

Leave a comment