diff --git a/README.md b/README.md
index 31da2ab..ed0176a 100644
--- a/README.md
+++ b/README.md
@@ -1,8 +1,12 @@
# simple-osmWd2csv
-Simplest and stupid algoritm to convert big big OSM files into **simple CSV file for [Wikidata-tag](https://wiki.openstreetmap.org/wiki/Key:wikidata) analysis**.
+[Simplest and stupid algoritm (v0.0.2)](https://github.com/OSMBrasil/simple-osmWd2csv/tree/v0.0.2) and some evolution to convert big OSM files into **simple CSV file for [Wikidata-tag](https://wiki.openstreetmap.org/wiki/Key:wikidata) analysis**.
-This project is suitable for *OSM beginners*, for Unix *agile* pipeline lovers... And for those who do not trust anyone: to *fact checking*.
+This project have two distinct parts:
+
+1. **XML parser**: converts a big [OSM XML file](https://wiki.openstreetmap.org/wiki/OSM_XML) into a lean tabular format, the intermediary CSV "raw" format. There are other tolls (eg. OPL), this one is suitable for *OSM beginners*, for Unix *agile* pipeline lovers... And for those who do not trust anyone: to *fact checking*.
The pipeline of [`step1-osmWd2csv_pre.sh`](src/step1-osmWd2csv_pre.sh) and [`step2-osmWd2csv.php`](src/step2-osmWd2csv.php).
+
+2. **Final parser and SQL database**. It is not so simple, have some mix of libraries and medium-complexity SQL processing. Produce all final products, including the exportation to the proposed data interchange format, `.wdDump.csv`.
The target of the "OSM Wikidata-elements CSV file" is to feed PostgreSQL (or SQLite) with complete and reliable data. See eg. [Semantic-bridge OSM-Wikidata](https://github.com/OSMBrasil/semantic-bridge) project.
@@ -10,16 +14,18 @@ The target of the "OSM Wikidata-elements CSV file" is to feed PostgreSQL (or SQL
This project also define two simple data-interchange formats for tests and benchmark of OSM-Wikidata tools.
-**`.wdDump.csv` format**: is the best way to dump, **analyse or interchange OSM-Wikidata "bigdata"**. Is a [standard CSV](https://en.wikipedia.org/wiki/Comma-separated_values) file with columns ``.
The first field, `osm_type`, is the [OSM element type](https://wiki.openstreetmap.org/wiki/Elements), abbreviated as a letter ("n" for node, "w" for way and "r" for relation); the second its real ID, in the date of the capture; the third its Wikidata ID (a *qid* without the "Q" prefix), sometimes no ID, sometimes more tham one ID; and the fourth, `wd_member_ids`, is a set of space-separed Wikidata IDs of member-elements, that eventually can be assumed as self (parent element). The `.wdDump.csv` is the final format of the parsing process described also in this repo.
Consumes **~0.01% of the XML (`.osm`) format**, of the wikidata-filtered file, and its zipped file ~0.4% of the `.osm.pbf` format — see the [summary of file sizes](example.md). In CPU time to process or analyse is also a big gain... And for data-analists is a **standard source of the truth** at any SQL tool. Example:
+**`.wdDump.csv` format**: is the best way to dump, **analyse or interchange OSM-Wikidata "bigdata"**. Is a [standard CSV](https://en.wikipedia.org/wiki/Comma-separated_values) file with columns
``.
+
+The first field, `osm_type`, is the [OSM element type](https://wiki.openstreetmap.org/wiki/Elements), abbreviated as a letter ("n" for node, "w" for way and "r" for relation); the second its real ID, in the date of the capture; the third its Wikidata ID (a *qid* without the "Q" prefix), sometimes no ID, sometimes more tham one ID; and the fourth, `wd_member_ids`, is a set of space-separed Wikidata IDs of member-elements, that eventually can be assumed as self (parent element). The `.wdDump.csv` is the final format of the parsing process described also in this repo.
Consumes **~0.01% of the XML (`.osm`) format**, of the wikidata-filtered file, and its zipped file ~0.4% of the `.osm.pbf` format — see the [summary of file sizes](example.md). In CPU time to process or analyse is also a big gain... And for data-analists is a **standard source of the truth** at any SQL tool. Example:
-osm_type |osm_id |wd_ids|wd_member_ids
-----------|-------|------|-----
-n |32011242 |Q49660|
-w |28712148 |Q1792561|
-w |610491098 |Q18482699|Q18482699:2
-r |1988261 | | Q315548:49
-r |51701 | Q39 | Q11925:1 Q12746:1
-r |3366718 | Q386331 | Q386331:15
+osm_type |osm_id |wd_id|geohash|wd_member_ids
+----------|-------|------|-----|-------
+n |[32011242](https://www.openstreetmap.org/node/32011242) |[Q49660](http://wikidata.org/entity/Q49660)| [`u0qu3jyt`](http://geohash.org/u0qu3jyt) |
+w |[28712148](https://www.openstreetmap.org/way/28712148) |[Q1792561](http://wikidata.org/entity/Q1792561)| [`u0qu0tmz`](http://geohash.org/u0qu0tmz)|
+w |[610491098](https://www.openstreetmap.org/way/610491098) |[Q18482699](http://wikidata.org/entity/Q18482699)|[`6vjwdr`](http://geohash.org/6vjwdr) | [Q18482699](http://wikidata.org/entity/Q18482699):2
+r |[1988261](https://www.openstreetmap.org/relation/1988261) | |[`u1j1`](http://geohash.org/u1j1) |[Q315548](http://wikidata.org/entity/Q315548):49
+r |[51701](https://www.openstreetmap.org/relation/51701) | Q39 |[`u0m`](http://geohash.org/u0m) |Q11925:1 Q12746:1
+r |[3366718](https://www.openstreetmap.org/relation/3366718) | Q386331 |[`6ur`](http://geohash.org/6ur)| Q386331:15
The same table in SQL can be converted in JSON or JSONb with the following structure:
@@ -27,7 +33,8 @@ The same table in SQL can be converted in JSON or JSONb with the following struc
TABLE wdOsm.raw (
osm_type char NOT NULL, -- reduced to n/w/r
osm_id bigint NOT NULL,
- wd_ids bigint[], -- "Q" removed
+ wd_id bigint, -- "Q" removed
+ geohash bigint, -- must convert to base32
member_wd_ids JSONb, -- e.g. {"315548":49}
-- option bigint[[key,value]] = array[array[315548,49],array[392600,2]]
-- and bigint2d_find(a,needle)
diff --git a/src/README.md b/src/README.md
index a06ffb0..ff04cc2 100644
--- a/src/README.md
+++ b/src/README.md
@@ -1,2 +1,5 @@
-Simple tools and algorithms.
+Tools and algorithms to parse OSM into Wikidata-OSM simple database.
+**XML parser**: converts a big [OSM XML file](https://wiki.openstreetmap.org/wiki/OSM_XML) into a lean tabular format, the intermediary CSV "raw" format. The pipeline of [`step1-osmWd2csv_pre.sh`](src/step1-osmWd2csv_pre.sh) and [`step2-osmWd2csv.php`](src/step2-osmWd2csv.php).
+
+**Final parser and SQL database**. It is not so simple, have medium-complexity SQL processing. See all `step*.sql` files.
diff --git a/src/step0-2-osmWd_strut.sql b/src/step0-2-osmWd_strut.sql
index 6f34fa6..9b95981 100644
--- a/src/step0-2-osmWd_strut.sql
+++ b/src/step0-2-osmWd_strut.sql
@@ -48,10 +48,6 @@ CREATE TABLE wdosm.main (
count_parseref_ids int, -- the total number of refs in the parsed member-set
UNIQUE(sid,osm_type,osm_id)
);
-CREATE UNIQUE INDEX wdosm_exp1_index ON wdosm.main (osm_type,osm_id);
--- now any other can run by DELETE / INSERT instead TO CREATE.
--- so , run wdosm.alter_tmp_raw_csv('YOURS',true)
-
COMMENT ON TABLE wdosm.main IS $$Main table for Wikidata-tag OSM database dump.
The uniqueness if osm_id is complemented by osm_type and, as each country can repeat borders, sid.
diff --git a/src/step2-osmWd2csv.php b/src/step2-osmWd2csv.php
index ff6c439..a148c8b 100644
--- a/src/step2-osmWd2csv.php
+++ b/src/step2-osmWd2csv.php
@@ -20,8 +20,6 @@
$nm = substr($r->localName,0,1); // results in n=node, w=way, r=relation
$lat = $r->getAttribute('lat'); // on node element, coordinates as centroid
$lastCentroid = $lat? ('c'.GeoHash::encode($r->getAttribute('lon'),$lat).' '): '';
- // convert to 64 bits later in the database, and truncate according map_feature type.
- //OLD numMix( numPad() , numPad($r->getAttribute('lon'),3) ).' '): '';
$lastLine = "$nm,". $r->getAttribute('id') .',';
} elseif ($r->nodeType == XMLReader::TEXT) {
$tx .= trim( preg_replace('/\s+/s',' ',$r->value) );
@@ -36,23 +34,3 @@ function printLine($line,$tx,$nm) {
if ($nm!=ND || $tx) // exclude empty nodes
print "\n".trim("$line$tx");
}
-
-/* optional didactic interlace:
-function numPad($x,$a_len=2) {
- if (preg_match('/^-?(\d+)(?:\.(\d{1,4})\d*)?$/',$x,$m))
- return str_pad( $m[1], $a_len, '0', STR_PAD_LEFT)
- .str_pad( isset($m[2])?$m[2]:'0' , 4, '0');
- else return '';
-}
-
-function numMix($x,$y) {
- $x_len = strlen($x);
- $y_len = strlen($y);
- $x = str_split($x);
- $y = str_split($y);
- for( $i=0, $s=''; $i<$x_len; $i++ )
- $s.=$x[$i].$y[$i];
- if ($y_len>$x_len) $s.='0'.$y[$y_len-1];
- return $s;
-}
-*/