Memorial University - Electronic Theses and Dissertations 4
TitleA bottom-up approach for XML document classification
AuthorWu, Junwei.
DescriptionThesis (M.Sc.)--Memorial University of Newfoundland, 2009. Computer Science
Paginationviii, 64 leaves : ill.
SubjectData mining; XML (Document markup language)--Classification;
Degree GrantorMemorial University of Newfoundland. Dept. of Computer Science
DisciplineComputer Science
NotesIncludes bibliographical references (leaves 61-64)
AbstractExtensible Markup Language (XML) is a simple and flexible text format derived from Standard Generalized Markup Language (SGML) [1]. It has been widely accepted as a crucial component of many information retrieval related applications, such as XML databases, web services, etc. One of the reasons for its wide acceptance is its customized format during data transmission or storage. Classification is an important data mining task that aims to assign unknown objects to classes that best characterize them. In this thesis, we propose a method to classify XML documents under the assumption that they do not have a common schema that may or may not be available, which is closer to the real cases. Our method is similarity-based. Its main characteristic is its way to handle the roles played by texts and the structural information. Unlike most existing methods, we use a bottom-up approach, i.e., we start from the text first, and then embed the structural information. This is based on the observation that in XML documents with diversified tag structures, the most informative information is carried by the terms in the texts. Our experiments show that this strategy can achieve a better performance than the existing methods for documents from sources that exhibit heterogeneous structures.
Resource TypeElectronic thesis or dissertation
FormatImage/jpeg; Application/pdf
SourcePaper copy kept in the Centre for Newfoundland Studies, Memorial University Libraries
Local Identifiera3243873
RightsThe author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.
CollectionElectronic Theses and Dissertations
Scanning StatusCompleted
PDF File(8.26 MB) --
CONTENTdm file name41293.cpd