Skip to content

Commit

Permalink
Ajouter les règlements de Sainte-Agathe-des-Monts (#30)
Browse files Browse the repository at this point in the history
* fix: supprimer des scripts inutilises

* fix: reparer des erreurs dans les reglements de sainte-agathe

* fix: ne pas reecrire un csv existant

* chore: retrain

* feat: telecharger sainte-agathe aussi

* fix: lint
  • Loading branch information
dhdaines authored Jul 25, 2024
1 parent cbb0c0e commit 5ca849c
Show file tree
Hide file tree
Showing 25 changed files with 6,537 additions and 1,843 deletions.
3 changes: 3 additions & 0 deletions .github/workflows/analyse.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,12 +52,15 @@ jobs:
fi
done
done
alexi -v download -u https://vsadm.ca/citoyens/reglementation/reglementation-durbanisme/index.html -o download/vsadm --all-pdf-links
- name: Extract
run: |
alexi -v extract -m download/index.json download/*.pdf
alexi -v extract -m download/vsadm/index.json -o export/vsadm download/vsadm/*.pdf
- name: Index
run: |
alexi -v index export
alexi -v index export/vsadm
- name: Setup Pages
uses: actions/configure-pages@v5
- name: Upload artifact
Expand Down
5 changes: 4 additions & 1 deletion TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,10 @@ DATA

- Correct titles in zonage glossary
- Correct extraction (see below, use RNN) of titles numbers etc
- Annotate multiple TOCs in Sainte-Agathe urbanisme
- Annotate multiple TOCs in Sainte-Agathe urbanisme DONE
- Add Sainte-Agathe to download DONE
- Add Sainte-Agathe to export (under /vsadm) DONE
- Do the same thing for Saint-Sauveur

DERP LERNING
------------
Expand Down
6 changes: 5 additions & 1 deletion alexi/annotate.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,10 @@ def main(args: argparse.Namespace) -> None:
with open(args.csv, "rt", encoding="utf-8-sig") as infh:
iob = list(csv.DictReader(infh))
else:
args.csv = args.out.with_suffix(".csv")
if args.csv.exists():
LOGGER.error("Fichier déjà existant: %s", args.csv)
return
if args.segment_model is not None:
crf = Segmenteur(args.segment_model)
crf_n = crf
Expand All @@ -155,7 +159,7 @@ def main(args: argparse.Namespace) -> None:
else:
segs = crf(feats)
iob = list(spread_i(crf_s(segs)))
with open(args.out.with_suffix(".csv"), "wt") as outfh:
with open(args.csv, "wt") as outfh:
write_csv(iob, outfh)
annotate_pdf(args.doc, pages, iob, args.out.with_suffix(".pdf"))

Expand Down
32 changes: 22 additions & 10 deletions alexi/download.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,12 @@ def add_arguments(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
action="append",
default=[],
)
parser.add_argument(
"--all-pdf-links",
action="store_true",
help="Télécharger les liens vers des PDF dans le document "
"sans égard à sa structure",
)
parser.add_argument(
"section",
help="Expression régulière pour sélectionner la section des documents",
Expand Down Expand Up @@ -71,19 +77,25 @@ def main(args: argparse.Namespace) -> None:
paths = []
with open(args.outdir / Path(u.path).name) as infh:
soup = BeautifulSoup(infh, "lxml")
for h2 in soup.find_all("h2", string=re.compile(args.section, re.I)):
ul = h2.find_next("ul")
for li in ul.find_all("li"):
path = li.a["href"]
excluded = False
for r in excludes:
if r.search(path):
excluded = True
break
if not excluded:
if args.all_pdf_links:
for a in soup.find_all("a"):
path = a["href"]
if path.lower().endswith(".pdf"):
paths.append(path)
else:
for h2 in soup.find_all("h2", string=re.compile(args.section, re.I)):
ul = h2.find_next("ul")
for li in ul.find_all("li"):
paths.append(li.a["href"])
urls = {}
for p in paths:
excluded = False
for r in excludes:
if r.search(p):
excluded = True
break
if excluded:
continue
up = urllib.parse.urlparse(p)
if up.netloc:
url = p
Expand Down
Binary file modified alexi/models/crf.joblib.gz
Binary file not shown.
Binary file modified alexi/models/crf.vl.joblib.gz
Binary file not shown.
Binary file modified alexi/models/crfseq.joblib.gz
Binary file not shown.
852 changes: 852 additions & 0 deletions data/patches/u51_patch1.csv

Large diffs are not rendered by default.

807 changes: 807 additions & 0 deletions data/patches/u51_patch2.csv

Large diffs are not rendered by default.

336 changes: 336 additions & 0 deletions data/patches/u51_patch3.csv

Large diffs are not rendered by default.

674 changes: 674 additions & 0 deletions data/patches/u51_patch4.csv

Large diffs are not rendered by default.

1,746 changes: 1,746 additions & 0 deletions data/patches/u53_patch1.csv

Large diffs are not rendered by default.

465 changes: 465 additions & 0 deletions data/patches/u53_patch2.csv

Large diffs are not rendered by default.

611 changes: 611 additions & 0 deletions data/patches/u53_patch3.csv

Large diffs are not rendered by default.

1,001 changes: 1,001 additions & 0 deletions data/patches/u58_patch1.csv

Large diffs are not rendered by default.

15 changes: 8 additions & 7 deletions scripts/comparables.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
wget --no-check-certificate --timestamping --recursive --level=1 https://www.val-morin.ca/reglementation_formulaires/reglementation_urbanisme.php
wget --no-check-certificate --timestamping --recursive --level=1 https://www.valdavid.com/services-aux-citoyens/urbanisme-et-permis/reglements-durbanisme/
wget --no-check-certificate --timestamping --recursive --level=1 https://www.piedmont.ca/fr/administration-municipale/reglements-municipaux/
wget --no-check-certificate --timestamping --recursive --level=1 https://www.vss.ca/services-aux-citoyens/services/reglementation-durbanisme
wget --no-check-certificate --timestamping --recursive --level=1 https://www.sadl.qc.ca/reglements-urbanisme/
wget --no-check-certificate --timestamping --recursive --level=1 https://www.morinheights.com/Reglements
wget --no-check-certificate --timestamping --recursive --level=1 https://lacmasson.com/services-aux-citoyens/plan-durbanisme
wget --accept pdf --no-check-certificate --timestamping --recursive --level=1 https://vsadm.ca/citoyens/reglementation/reglementation-durbanisme/
wget --accept pdf --no-check-certificate --timestamping --recursive --level=1 https://www.val-morin.ca/reglementation_formulaires/reglementation_urbanisme.php
wget --accept pdf --no-check-certificate --timestamping --recursive --level=1 https://www.valdavid.com/services-aux-citoyens/urbanisme-et-permis/reglements-durbanisme/
wget --accept pdf --no-check-certificate --timestamping --recursive --level=1 https://www.piedmont.ca/fr/administration-municipale/reglements-municipaux/
wget --accept pdf --no-check-certificate --timestamping --recursive --level=1 https://www.vss.ca/services-aux-citoyens/services/reglementation-durbanisme
wget --accept pdf --no-check-certificate --timestamping --recursive --level=1 https://www.sadl.qc.ca/reglements-urbanisme/
wget --accept pdf --no-check-certificate --timestamping --recursive --level=1 https://www.morinheights.com/Reglements
wget --accept pdf --no-check-certificate --timestamping --recursive --level=1 https://lacmasson.com/services-aux-citoyens/plan-durbanisme
220 changes: 0 additions & 220 deletions scripts/make_funsd_data.py

This file was deleted.

4 changes: 0 additions & 4 deletions scripts/problem_pages.sh

This file was deleted.

44 changes: 0 additions & 44 deletions scripts/retokenize.py

This file was deleted.

Loading

0 comments on commit 5ca849c

Please sign in to comment.