GSoC_2016_Progress_Federica

The List Extractor

###Abstract The project focuses on the extraction of relevant but hidden data which lies inside lists in Wikipedia pages. The information is unstructured and thus cannot be easily used to form semantic statements and be integrated in the DBpedia ontology. Hence, the main task consists in creating a tool which can take one or more Wikipedia pages with lists within as an input and then construct appropriate mappings to be inserted in a DBpedia dataset. The extractor must prove to work well on a given domain and to have the ability to be expanded to reach generalization.

###Extraction Process, in short Depending on the input, the Extractor will analyze a single Wikipedia page or all pages about resources from a given DBpedia ontology class. In both cases, each page is parsed using JSONpedia web service, obtaining a representation of the lists present, linked to their section and subsection title. At this point, it looks for a mapping suiting the resource class and confronts the section titles to find a match with a list of keywords, depending on the requested language. If a match is found, a related mapping function is applied to each list element to form semantic triples and construct a RDF graph (for example, from a bibliography list of a writer it tries to extract info about his/her works, their literary genre, publication year and isbn code). Finally, if the graph is not empty, all statements are serialized in a .ttl file.

###Achievements The main concept behind List Extractor is about using the information we have about lists in the Wikipedia page in order to select a suitable rule to form RDF statements. This is very important since the list itself, because of its very nature, doesn't carry any metadata about the information that it is expressing. To overcome this obstacle I have decided to exploit the information carried by the list section title and by the type of resource (obtained from querying a DBpedia SPARQL endpoint). This means that knowing the resource type and analyzing the section title I decide for a certain mapping function to be applied to each list element.

I have explored the domains of Actors and Writers both in english and italian languages, since I found their lists about filmography and bibliography respectively to be interesting and abundant (before I started coding I did some research about the numerosity of lists in various domains, collaborating with fellow DBpedia GSOCer papalinis for statistics.py, a script useful for both of us included in Table Extractor). The results have been iteratively refined during these months, and I have provided evaluations obtained by manually analyzing a sample of resources and confronting them with each Wikipedia page, as well as complete datasets that you can find in the repository.

Topic & Language	# Resources	# Statements	Sample Evaluation
Writers IT	1'177	31'627	90%
Writers EN	29'581	302'180	79%
Actors IT	1'704	313'407	94%
Actors EN	6'621	110'797	77%

###Challenges The main challenge lies in the extreme variability of lists: unfortunately there isn't a real standard and there are multiple formats used along with different meanings depending on the user who wrote the page. Also, the strong dependability from the topic as well as the use of natural language makes it impossibile to find a general rule to extract semantic information without knowing in advance the kind of list and the resource type (at least without using advanced AI techniques, which are beyond the purpose of this project). Apart from the heterogeneity, unfortunately there are several Wikipedia pages with bad or wrong formatting, which is obviously reflected in the impurity of extracted data. I have manually tried to correct some of these pages but clearly it is an isignificant percentage.

###Future developments One of my main goals was to provide scalability so that the extractor could be expanded and reach a greater potential. I have (hopefully) achieved that by making the program pick from mapping_rules.py the relations between which mapping to use depending on the current resource domain, and from a given mapping, the keywords to be matched in section titles in order to form a statement. Anyone can extend the mapping process with a new language, simply inserting a new key as a language prefix and a list of keywords from that language. A new domain can be added in a similar way, but also requires a new custom mapping function inside mapper.py. Please refer to the README and to docstrings for a more detailed explanation of how mappings work, as well as modules description and how to run instructions.

###Last Updates (descending chronological order): [20-22 August] Code cleaning and documentation refinement :)

[19th August] Evaluation metrics on a sample of english pages about Actors can be found here, and has an overall accuracy of 77%.

[17th August] New commit with a dataset about italian Actor pages provided. The related quality evaluation on a sample can be found here, and has an overall accuracy of 94%.

[16th August] General review of code and documentation. Moreover, the Actor mapping has been greatly improved by adding properties describing how he/she participated in the film (actual actor or director, editor...), and the kind of film (actual movie or cartoon, tv show...).

[14th August] I'm extracting lists from actors' filmography: every element under a filmography section is labeled as dbo:Movie, but unfortunately after some test I realized that this information is not sufficient to say that the person represented in the page acts in that movie, because in many cases these actors are also directors, and their filmography actually contains the directed movies. This means that to avoid uncertainty I look for other key-words in section and sub-section names to state which role was played. In this way I can extract fewer information but it is more correct. The main problem observed for this project is the extreme variety of lists on Wikipedia: finding a generalizing extraction algorithm sometimes isn't even possible because they are too different from each other and there isn't a standard for users to follow. During this period I have sometimes modified and updated some pages which were clearly wrong or with a bad formatting, but obviously it isn't nearly enough.

[11th August] Scalability is essential and I am extending the solutions for resources from Actor class: the extractor may be expanded by anybody in the future, both for new domains and languages; (hopefully) with very little effort. If you ask for a single resource, the program asks the endpoint what kind of resource it is (rdf:type property) and looks for a suitable mapping for each class (e.g. a page about a person who is both a Writer and an Actor and contains lists of bibliography and filmography). On the other hand, you can ask for a general class and it will collect every resource of the same type and apply the related mapping.

[7th August] Added a new feature to extract ISBN numbers from bibliography elements (if present). The information extracted from lists in pages about Writers includes: work titles, their author, publication year, ISBN number and literary genre. I'm currently doing some research and looking for another suitable domain to perform list extraction.

[2nd August] Quality evaluations about extraction are available here for English pages and here for Italian pages about Writers. There is an overall accuracy of 90% for Italian and 79% for English.

[1st August] The new commit includes an improved mapping and the automatic search for Wikipedia page redirects (the DBpedia resources don't always match, so this prevents from losing information). The current extraction process returns 14'833 list elements from all the 1117 italian pages about writers, 257'095 list elements from all the 29'581 English pages.

[25th July] Working on an update: mapping process is more precise and includes the exploitation of italic formatting, which is used very often to identify work titles. I will soon publish new datasets and a quality evaluation.

[23rd July] I am currently analyzing a sample of mapped resources to define quality metrics, as my mentors suggested (I can't manually check all resources because they are almost 30'000). The process is slow and tedious since it every page must be checked and confronted with the dataset, but it's helping me to have a better understanding of the mapping problems and to focus on future improvements.

[12th July] Mapping has become more precise, resulting in cleaner datasets (which can be found in the repository). I am extending the program to try to reconcile also the literary genre of the literary works found for the given author (Now it maps the works from the author and the publication year). Moreover, I am heading towards an attempt of generalization so that more languages and more topics can be mapped by adding new simple rules.

[6th July] Major refactoring of statistics.py within Table Extractor with papalinis.

[5th July] Refining and dataset publication. Now it's possible to specify a single writer apart from of all of them, and instructions about how to use the script are provided. Some problems such as encoding struggles have (hopefully) been solved. I also had a meeting with my mentors in which we discussed expandibility and quality metrics of datasets.

[30th June] Focusing more in data creation as my mentor suggested. Thanks to JSONpedia improved performance and some bug fixing, I can now obtain a larger quantity of data, but I must further improve its quality.

[22nd June] I'm analyzing the resulting RDF triples and working on the adjustment and refinement of the current solution. At the moment I parse pages about writers and extract info about their works, associating them with the author and the release year (where available).

[18th June] Obtaining first results from writer pages in Italian and English. I used Wikidata API to reconcile URIs and a sparql call to find the equivalent resource on DBpedia, as suggested by my mentor. I also used regex to extract relevant info from unstructured text contained in list elements. There is still much to do to refine this solution, but I think I'm on the right track :)

[14th June] Working on Mapper module. I'm using DBpedia Lookup service to retrieve the URI represented by a given string. Unfortunately it only works for english language and after some tests I realized that it can't be used for section titles since the accuracy is too low. I'm considering now a new approach, different from my original idea.

[9th June] Parser module completed. Successful on resources: List of Works of William Gibson, english William Gibson and italian William Gibson Now proceeding with further testing and figuring out how to implement next modules.

[6th June] Currently working on cleaning and parsing as lists the data obtained by [JSONpedia] (http://jsonpedia.org/frontend/index.html). Starting from page https://en.wikipedia.org/wiki/List_of_works_of_William_Gibson.

[30th May]

There are 29581 English Wiki pages about Writers, including 90717 lists.
In italian there are 1177 pages about writers, including 3232 lists. It seems an interesting domain and I'm going to start from a writer and his lists of works (currently working on William Gibson). Other good candidates are directors and actors (with their lists of featured movies and awards).
181790 English pages about directors, with 52114 lists
6326 Italian pages about directors, with 24957 lists

[23th May] START OF CODING. Figuring out how to include and use JSONpedia on my project, as the online web service is often unavailable due to crawlers. I will use the online web service for now (http://jsonpedia.org/frontend/index.html).

[22nd May] ENDING OF BONDING PERIOD. I had a constant discussion and feedback with my mentors and made some preliminary analysis via the python script (statistics.py) available on the Table Extractor; I am ready to start coding. I have also installed the DBpedia extraction framework and performed an extraction on italian wiki pages to gain a better understanding of the framework.

[19th May] Discussion with all co-mentors about suitable wiki domains for lists. We decided to further examine filmographies, bibliographies and related contexts such as lists of nominations and awards.

[18th May] Contributing with papalinis in statistics.py from Table Extractor to do some domain analysis useful for both table and list extractor. Will be improved shortly.

[11th May] I am currently analyzing the occurrencies of lists in various Wikipedia pages and querying SPARQL Endpoints to choose a suitable domain to start with.

Mentors:

Marco Fossati
Claudia Diamantini
Domenico Potena
Emanuele Storti

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC_2016_Progress_Federica

The List Extractor

Mentors:

Clone this wiki locally