Skip to content

A framework for anonymizing RDF datasets to ensure safety for publishing in the Linked Open Data cloud.

Notifications You must be signed in to change notification settings

RdNetwork/safe-lod-anonymizer

Repository files navigation

Static Linked Data Anonymizer

This is a the implementation of a research project whose goal is to find ways to anonymize Linked Open Data, notably RDF datasets. This framework anonymizes RDF datasets before their publication in the LOD cloud and ensures safety, i.e. it prevents leakages by ensuring that new results for a given set of privacy queries are not created by the union with any other external dataset. This lets us use static algorithms which do not depend on an input graph, and are efficient in terms of runtime.
This framework is part of submission for the ISWC2019 conference.

This program currently works by using a previously generated query workload, picking a fixed number of random queries in this workload to affect them in a privacy policy (a set of queries). The code then computes a set of operations which must performed on a graph to make it safe given the privacy constraints provided as input. An default example with a simple, fixed privacy policy is also provided.

Project architecture

  • conf directory: directory for configuration and input files (query workloads, test graphs, gMark configuration files).
  • *.py files: framework implementation in Python.
    • main.py being the main program to run the software
  • runtime_test.sh runs 20 iterations of the program with policies of cardinalty 10, and measure their runtime.

Setup

To reproduce the various experiments and examples from the ISWC submission, you can execute the code yourself on our example graph schema.

Prerequisites

This project uses Python and should work on any version of Python 3 and any version of Python starting from Python 2.7. You must install the following Python libraries to be able to run this program:

The query workload is created using gMark, which uses an XML configuration file to generate graphs and queries. You can follow the instructions on the gMark project to generate your own workload, or use the ones provided in the /conf/workloads directory. By default, /conf/workloads/starchain-workload.xml is used.

Example graph configuration

For indication only, the configuration file describing the graph schema used for the query workload generation is provied, in the file /conf/test.xml. You can reuse this file as explained in the gMark documentation to generate your own graph and workload. This file describes a graph schema modeling a transportation network and its users (persons, travels, lines, subscriptions), using 12 predicates linking 13 data types, in a final graph of 20000 nodes.

Usage

Standard execution

To run the program and anonymize a graph, just run it as follows:

python main.py

The graph anonymization feature is still a work in progress. We advise you perform yourself the operations we provide on your usual triple store engine. Right now, the code uses a Turtle-formatted RDF graph, named graph.ttl in the /conf/graphs directory.

You can also use the demo mode, which is ran using a shortened workload, with 2 privacy queries and 2 utility queries used the article's examples:

python main.py -d

The demo mode also accepts policies written as text files. Policies have to be stored in the /conf/workload/policies folder, placed in a separate folder named using the graph name (see CONFIG.ini) and named adequatly: p*.rq where * ranges from 1 to the number of privacy queries + 1.

To run this "textual mode", for example with a policy featuring two queries, run the following:

python main.py 2 -txt

Two example policy files are provided, using a similar example to the standard demo mode's one.

The standard execution will compute a set of operations, such that each one of them must be performed on a graph to make it safe. This list of operations is written to the file out/ops.txt. After a validation, the operations are performed on the graph, creating several output files:

  • A copy file of the original graph, named [original graph file name]_orig.ttl
  • One output file per operation, named [original graph file name]_anonymized_stepX.ttl, X being the number of the applied operation.

/!\ WARNING: running the program in standard mode will NOT erase previous outputs. RDF stores files can get pretty big, so be careful!

Running the tests

To reproduce the runtime experiment presented in the ESWC submission, you can run the following script:

./runtime_test.sh

This will create a privacy policy of ten queries, randomly picked from the workload, and compute the set of operations for this policy. This experiment is then looped 100 times. We run this test another 20 times to ensure new random seeding.

About

A framework for anonymizing RDF datasets to ensure safety for publishing in the Linked Open Data cloud.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published