a data warehouse-oriented methodology for qualitative semi-structured web information and social networking sites' user status search

kabir, md. shahriar

view/open

kabirmds2019-1a.pdf (19.07mb)

date

2019

author

kabir, md. shahriar

metadata

show full item record

abstract

finding most desired and useful information from the diverse information and content embedded on webpages has become more challenging due to the rapid growth of websites and webpages, dynamic changes and updates of information and content on webpages, the lack of well-formed structure of webpage content and so on. information search seems a trivial task when plain text, hyperlink texts, embedded images, videos that all make up webpage content remain in semistructured form. semi-structured webpage content do not have predefined structure and remains in hierarchically nested html tags of a webpage body. unlike structured webpage content, heterogeneous semi-structured webpage content can’t be neatly formatted, organized and modeled directly into relational database. one of the most important information types on the web is web user’s emotion expressed in user-posted status on social networking sites like facebook, twitter. publicly posted user status is informative enough to know user’s daily thoughts, feelings, emotions through textual self-description. the data warehouse-oriented methodology of semi-structured webpage content extraction and modeling into database introduces a simplified and less labor intensive xml-based semi-structured webpage content extraction technique that overcomes the limitations of existing pre-defined specification file and wrapper-based techniques to adapt rapid changes of webpage content and to extract same piece of information on different webpages having differentiated nested html structure. this methodology also introduces multidimensional fact data modeling technique for semi-structured webpage content storage into relational database. our implemented methodology ensures qualitative search result in terms of hyperlinks to most desired webpages appearing first with a relatively very low fractional amount of minute. [...]

uri

https://knowledgecommons.lakeheadu.ca/handle/2453/5185

collections

electronic theses and dissertations from 2009 [1627]