Skip to content

Commit

Permalink
drop wdosm_exp1_index for #3 and rev readmes and comments
Browse files Browse the repository at this point in the history
  • Loading branch information
ppKrauss committed Aug 22, 2018
1 parent bf4b501 commit c5550b9
Show file tree
Hide file tree
Showing 4 changed files with 23 additions and 39 deletions.
31 changes: 19 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,40 @@
# simple-osmWd2csv

Simplest and stupid algoritm to convert big big OSM files into **simple CSV file for [Wikidata-tag](https://wiki.openstreetmap.org/wiki/Key:wikidata) analysis**.
[Simplest and stupid algoritm (v0.0.2)](https://github.com/OSMBrasil/simple-osmWd2csv/tree/v0.0.2) and some evolution to convert big OSM files into **simple CSV file for [Wikidata-tag](https://wiki.openstreetmap.org/wiki/Key:wikidata) analysis**.

This project is suitable for *OSM beginners*, for Unix *agile* pipeline lovers... And for those who do not trust anyone: to *fact checking*.
This project have two distinct parts:

1. **XML parser**: converts a big [OSM XML file](https://wiki.openstreetmap.org/wiki/OSM_XML) into a lean tabular format, the intermediary CSV "raw" format. There are other tolls (eg. OPL), this one is suitable for *OSM beginners*, for Unix *agile* pipeline lovers... And for those who do not trust anyone: to *fact checking*.<br/> The pipeline of [`step1-osmWd2csv_pre.sh`](src/step1-osmWd2csv_pre.sh) and [`step2-osmWd2csv.php`](src/step2-osmWd2csv.php).

2. **Final parser and SQL database**. It is not so simple, have some mix of libraries and medium-complexity SQL processing. Produce all final products, including the exportation to the proposed data interchange format, `.wdDump.csv`.

The target of the "OSM Wikidata-elements CSV file" is to feed PostgreSQL (or SQLite) with complete and reliable data. See eg. [Semantic-bridge OSM-Wikidata](https://github.com/OSMBrasil/semantic-bridge) project.

## Basic results and aims

This project also define two simple data-interchange formats for tests and benchmark of OSM-Wikidata tools.

**`.wdDump.csv` format**: is the best way to dump, **analyse or interchange OSM-Wikidata "bigdata"**. Is a [standard CSV](https://en.wikipedia.org/wiki/Comma-separated_values) file with columns `<osm_type,osm_id,wd_ids,wd_member_ids>`. <br/>The first field, `osm_type`, is the [OSM element type](https://wiki.openstreetmap.org/wiki/Elements), abbreviated as a letter ("n" for node, "w" for way and "r" for relation); the second its real ID, in the date of the capture; the third its Wikidata ID (a *qid* without the "Q" prefix), sometimes no ID, sometimes more tham one ID; and the fourth, `wd_member_ids`, is a set of space-separed Wikidata IDs of member-elements, that eventually can be assumed as self (parent element). The `.wdDump.csv` is the final format of the parsing process described also in this repo. <br/>Consumes **~0.01% of the XML (`.osm`) format**, of the wikidata-filtered file, and its zipped file ~0.4% of the `.osm.pbf` format &mdash; see the [summary of file sizes](example.md). In CPU time to process or analyse is also a big gain... And for data-analists is a **standard source of the truth** at any SQL tool. Example:
**`.wdDump.csv` format**: is the best way to dump, **analyse or interchange OSM-Wikidata "bigdata"**. Is a [standard CSV](https://en.wikipedia.org/wiki/Comma-separated_values) file with columns <br/> &nbsp;&nbsp; `<osm_type,osm_id,wd_ids,wd_member_ids>`.

The first field, `osm_type`, is the [OSM element type](https://wiki.openstreetmap.org/wiki/Elements), abbreviated as a letter ("n" for node, "w" for way and "r" for relation); the second its real ID, in the date of the capture; the third its Wikidata ID (a *qid* without the "Q" prefix), sometimes no ID, sometimes more tham one ID; and the fourth, `wd_member_ids`, is a set of space-separed Wikidata IDs of member-elements, that eventually can be assumed as self (parent element). The `.wdDump.csv` is the final format of the parsing process described also in this repo. <br/>Consumes **~0.01% of the XML (`.osm`) format**, of the wikidata-filtered file, and its zipped file ~0.4% of the `.osm.pbf` format &mdash; see the [summary of file sizes](example.md). In CPU time to process or analyse is also a big gain... And for data-analists is a **standard source of the truth** at any SQL tool. Example:

osm_type |osm_id |wd_ids|wd_member_ids
----------|-------|------|-----
n |32011242 |Q49660|
w |28712148 |Q1792561|
w |610491098 |Q18482699|Q18482699:2
r |1988261 | | Q315548:49
r |51701 | Q39 | Q11925:1 Q12746:1
r |3366718 | Q386331 | Q386331:15
osm_type |osm_id |wd_id|geohash|wd_member_ids
----------|-------|------|-----|-------
n |[32011242](https://www.openstreetmap.org/node/32011242) |[Q49660](http://wikidata.org/entity/Q49660)| [`u0qu3jyt`](http://geohash.org/u0qu3jyt) |
w |[28712148](https://www.openstreetmap.org/way/28712148) |[Q1792561](http://wikidata.org/entity/Q1792561)| [`u0qu0tmz`](http://geohash.org/u0qu0tmz)|
w |[610491098](https://www.openstreetmap.org/way/610491098) |[Q18482699](http://wikidata.org/entity/Q18482699)|[`6vjwdr`](http://geohash.org/6vjwdr) | [Q18482699](http://wikidata.org/entity/Q18482699):2
r |[1988261](https://www.openstreetmap.org/relation/1988261) | |[`u1j1`](http://geohash.org/u1j1) |[Q315548](http://wikidata.org/entity/Q315548):49
r |[51701](https://www.openstreetmap.org/relation/51701) | Q39 |[`u0m`](http://geohash.org/u0m) |Q11925:1 Q12746:1
r |[3366718](https://www.openstreetmap.org/relation/3366718) | Q386331 |[`6ur`](http://geohash.org/6ur)| Q386331:15

The same table in SQL can be converted in JSON or JSONb with the following structure:

```sql
TABLE wdOsm.raw (
osm_type char NOT NULL, -- reduced to n/w/r
osm_id bigint NOT NULL,
wd_ids bigint[], -- "Q" removed
wd_id bigint, -- "Q" removed
geohash bigint, -- must convert to base32
member_wd_ids JSONb, -- e.g. {"315548":49}
-- option bigint[[key,value]] = array[array[315548,49],array[392600,2]]
-- and bigint2d_find(a,needle)
Expand Down
5 changes: 4 additions & 1 deletion src/README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
Simple tools and algorithms.
Tools and algorithms to parse OSM into Wikidata-OSM simple database.

**XML parser**: converts a big [OSM XML file](https://wiki.openstreetmap.org/wiki/OSM_XML) into a lean tabular format, the intermediary CSV "raw" format. The pipeline of [`step1-osmWd2csv_pre.sh`](src/step1-osmWd2csv_pre.sh) and [`step2-osmWd2csv.php`](src/step2-osmWd2csv.php).

**Final parser and SQL database**. It is not so simple, have medium-complexity SQL processing. See all `step*.sql` files.
4 changes: 0 additions & 4 deletions src/step0-2-osmWd_strut.sql
Original file line number Diff line number Diff line change
Expand Up @@ -48,10 +48,6 @@ CREATE TABLE wdosm.main (
count_parseref_ids int, -- the total number of refs in the parsed member-set
UNIQUE(sid,osm_type,osm_id)
);
CREATE UNIQUE INDEX wdosm_exp1_index ON wdosm.main (osm_type,osm_id);
-- now any other can run by DELETE / INSERT instead TO CREATE.
-- so , run wdosm.alter_tmp_raw_csv('YOURS',true)


COMMENT ON TABLE wdosm.main IS $$Main table for Wikidata-tag OSM database dump.
The uniqueness if osm_id is complemented by osm_type and, as each country can repeat borders, sid.
Expand Down
22 changes: 0 additions & 22 deletions src/step2-osmWd2csv.php
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,6 @@
$nm = substr($r->localName,0,1); // results in n=node, w=way, r=relation
$lat = $r->getAttribute('lat'); // on node element, coordinates as centroid
$lastCentroid = $lat? ('c'.GeoHash::encode($r->getAttribute('lon'),$lat).' '): '';
// convert to 64 bits later in the database, and truncate according map_feature type.
//OLD numMix( numPad() , numPad($r->getAttribute('lon'),3) ).' '): '';
$lastLine = "$nm,". $r->getAttribute('id') .',';
} elseif ($r->nodeType == XMLReader::TEXT) {
$tx .= trim( preg_replace('/\s+/s',' ',$r->value) );
Expand All @@ -36,23 +34,3 @@ function printLine($line,$tx,$nm) {
if ($nm!=ND || $tx) // exclude empty nodes
print "\n".trim("$line$tx");
}

/* optional didactic interlace:
function numPad($x,$a_len=2) {
if (preg_match('/^-?(\d+)(?:\.(\d{1,4})\d*)?$/',$x,$m))
return str_pad( $m[1], $a_len, '0', STR_PAD_LEFT)
.str_pad( isset($m[2])?$m[2]:'0' , 4, '0');
else return '';
}
function numMix($x,$y) {
$x_len = strlen($x);
$y_len = strlen($y);
$x = str_split($x);
$y = str_split($y);
for( $i=0, $s=''; $i<$x_len; $i++ )
$s.=$x[$i].$y[$i];
if ($y_len>$x_len) $s.='0'.$y[$y_len-1];
return $s;
}
*/

0 comments on commit c5550b9

Please sign in to comment.