-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training not providing enough matches #1077
Comments
I am having a similar issue with record linkage: the training session gives mostly distincts and only very few matches. Problem: In version 2.0.17, the labelling gives a lot of pairs (>30) that are obvious non-links, and only 1-2 pairs that could be true links. There are about 30k records in both data sets. The features I use are:
I manually inspected some of the records: there are links to be found. I had used dedupe before on similar data and did not expect this. Thus, I tried out different versions, and at least in version 2.0.11 the labelling works much better (ie, many more pairs that are likely to be true links) with the same data. Environment: ubuntu 22.04. The project uses the following conda settings: #Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
affinegap 1.12 pypi_0 pypi
argon2-cffi 21.3.0 pyhd8ed1ab_0 conda-forge
argon2-cffi-bindings 21.2.0 py38h0a891b7_2 conda-forge
asttokens 2.0.5 pyhd8ed1ab_0 conda-forge
attrs 21.4.0 pyhd8ed1ab_0 conda-forge
backcall 0.2.0 pyh9f0ad1d_0 conda-forge
backports 1.0 py_2 conda-forge
backports.functools_lru_cache 1.6.4 pyhd8ed1ab_0 conda-forge
beautifulsoup4 4.11.1 pyha770c72_0 conda-forge
blas 1.0 mkl
bleach 5.0.1 pyhd8ed1ab_0 conda-forge
bottleneck 1.3.5 py38h7deecbd_0
brotli 1.0.9 he6710b0_2
brotlipy 0.7.0 py38h27cfd23_1003
btrees 4.10.0 pypi_0 pypi
ca-certificates 2022.4.26 h06a4308_0
categorical-distance 1.9 pypi_0 pypi
certifi 2022.6.15 py38h06a4308_0
cffi 1.15.0 py38hd667e15_1
charset-normalizer 2.0.4 pyhd3eb1b0_0
click 8.0.4 py38h06a4308_0
cryptography 37.0.1 py38h9ce1e76_0
cycler 0.11.0 pyhd3eb1b0_0
datetime-distance 0.1.3 pypi_0 pypi
dbus 1.13.18 hb2f20db_0
debugpy 1.6.0 py38hfa26641_0 conda-forge
decorator 5.1.1 pyhd8ed1ab_0 conda-forge
dedupe 2.0.17 pypi_0 pypi
dedupe-variable-datetime 0.1.5 pypi_0 pypi
defusedxml 0.7.1 pyhd8ed1ab_0 conda-forge
doublemetaphone 1.1 pypi_0 pypi
entrypoints 0.4 pyhd8ed1ab_0 conda-forge
et_xmlfile 1.1.0 py38h06a4308_0
executing 0.8.3 pyhd8ed1ab_0 conda-forge
expat 2.4.4 h295c915_0
fastcluster 1.2.6 pypi_0 pypi
flit-core 3.7.1 pyhd8ed1ab_0 conda-forge
fontconfig 2.13.1 h6c09931_0
fonttools 4.25.0 pyhd3eb1b0_0
freetype 2.11.0 h70c0345_0
future 0.18.2 pypi_0 pypi
giflib 5.2.1 h7b6447c_0
glib 2.69.1 h4ff587b_1
gst-plugins-base 1.14.0 h8213a91_2
gstreamer 1.14.0 h28cd5cc_2
haversine 2.6.0 pypi_0 pypi
highered 0.2.1 pypi_0 pypi
icu 58.2 he6710b0_3
idna 3.3 pyhd3eb1b0_0
importlib-metadata 4.11.4 py38h578d9bd_0 conda-forge
importlib_resources 5.8.0 pyhd8ed1ab_0 conda-forge
intel-openmp 2021.4.0 h06a4308_3561
ipykernel 6.15.1 pyh210e3f2_0 conda-forge
ipython 8.4.0 py38h578d9bd_0 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
jedi 0.18.1 py38h578d9bd_1 conda-forge
jinja2 3.1.2 pyhd8ed1ab_1 conda-forge
joblib 1.1.0 pyhd3eb1b0_0
jpeg 9e h7f8727e_0
jsonschema 4.7.2 pyhd8ed1ab_0 conda-forge
jupyter_client 7.0.6 pyhd8ed1ab_0 conda-forge
jupyter_core 4.10.0 py38h578d9bd_0 conda-forge
jupyterlab_pygments 0.2.2 pyhd8ed1ab_0 conda-forge
kiwisolver 1.4.2 py38h295c915_0
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
levenshtein-search 1.4.5 pypi_0 pypi
libffi 3.3 he6710b0_2
libgcc-ng 11.2.0 h1234567_1
libgfortran-ng 7.5.0 ha8ba4b0_17
libgfortran4 7.5.0 ha8ba4b0_17
libgomp 11.2.0 h1234567_1
libpng 1.6.37 hbc83047_0
libsodium 1.0.18 h36c2ea0_1 conda-forge
libstdcxx-ng 11.2.0 h1234567_1
libtiff 4.2.0 h2818925_1
libuuid 1.0.3 h7f8727e_2
libwebp 1.2.2 h55f646e_0
libwebp-base 1.2.2 h7f8727e_0
libxcb 1.15 h7f8727e_0
libxml2 2.9.14 h74e7548_0
libxslt 1.1.35 h4e12654_0
lxml 4.9.1 py38h1edc446_0
lz4-c 1.9.3 h295c915_1
markupsafe 2.1.1 py38h0a891b7_1 conda-forge
matplotlib 3.5.1 py38h06a4308_1
matplotlib-base 3.5.1 py38ha18d171_1
matplotlib-inline 0.1.3 pyhd8ed1ab_0 conda-forge
mistune 0.8.4 py38h497a2fe_1005 conda-forge
mkl 2021.4.0 h06a4308_640
mkl-service 2.4.0 py38h7f8727e_0
mkl_fft 1.3.1 py38hd3c417c_0
mkl_random 1.2.2 py38h51133e4_0
munkres 1.1.4 py_0
nbclient 0.6.6 pyhd8ed1ab_0 conda-forge
nbconvert 6.5.0 pyhd8ed1ab_0 conda-forge
nbconvert-core 6.5.0 pyhd8ed1ab_0 conda-forge
nbconvert-pandoc 6.5.0 pyhd8ed1ab_0 conda-forge
nbformat 5.4.0 pyhd8ed1ab_0 conda-forge
ncurses 6.3 h5eee18b_3
nest-asyncio 1.5.5 pyhd8ed1ab_0 conda-forge
nltk 3.7 pyhd3eb1b0_0
notebook 6.4.12 pyha770c72_0 conda-forge
numexpr 2.8.3 py38h807cd23_0
numpy 1.22.3 py38he7a7128_0
numpy-base 1.22.3 py38hf524024_0
openpyxl 3.0.10 py38h5eee18b_0
openssl 1.1.1q h7f8727e_0
packaging 21.3 pyhd8ed1ab_0 conda-forge
pandas 1.4.3 py38h6a678d5_0
pandoc 2.18 ha770c72_0 conda-forge
pandocfilters 1.5.0 pyhd8ed1ab_0 conda-forge
parso 0.8.3 pyhd8ed1ab_0 conda-forge
pcre 8.45 h295c915_0
persistent 4.9.0 pypi_0 pypi
pexpect 4.8.0 pyh9f0ad1d_2 conda-forge
pickleshare 0.7.5 py_1003 conda-forge
pillow 9.2.0 py38hace64e9_1
pip 22.1.2 py38h06a4308_0
prometheus_client 0.14.1 pyhd8ed1ab_0 conda-forge
prompt-toolkit 3.0.30 pyha770c72_0 conda-forge
psutil 5.9.1 py38h0a891b7_0 conda-forge
ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge
pure_eval 0.2.2 pyhd8ed1ab_0 conda-forge
pycparser 2.21 pyhd8ed1ab_0 conda-forge
pygments 2.12.0 pyhd8ed1ab_0 conda-forge
pyhacrf-datamade 0.2.6 pypi_0 pypi
pylbfgs 0.2.0.14 pypi_0 pypi
pyopenssl 22.0.0 pyhd3eb1b0_0
pyparsing 3.0.9 pyhd8ed1ab_0 conda-forge
pyqt 5.9.2 py38h05f1152_4
pyrsistent 0.18.1 py38h0a891b7_1 conda-forge
pysocks 1.7.1 py38h06a4308_0
python 3.8.13 h12debd9_0
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
python-fastjsonschema 2.15.3 pyhd8ed1ab_0 conda-forge
python_abi 3.8 2_cp38 conda-forge
pytz 2022.1 py38h06a4308_0
pyzmq 19.0.2 py38ha71036d_2 conda-forge
qt 5.9.7 h5867ecd_1
readline 8.1.2 h7f8727e_1
regex 2022.3.15 py38h7f8727e_0
requests 2.28.1 py38h06a4308_0
rlr 2.4.6 pypi_0 pypi
scikit-learn 1.1.2 pypi_0 pypi
scipy 1.7.3 py38hc147768_0
send2trash 1.8.0 pyhd8ed1ab_0 conda-forge
setuptools 61.2.0 py38h06a4308_0
simplecosine 1.2 pypi_0 pypi
sip 4.19.13 py38h295c915_0
six 1.16.0 pyh6c4a22f_0 conda-forge
soupsieve 2.3.1 pyhd8ed1ab_0 conda-forge
sqlite 3.38.5 hc218d9a_0
stack_data 0.3.0 pyhd8ed1ab_0 conda-forge
terminado 0.15.0 py38h578d9bd_0 conda-forge
threadpoolctl 3.1.0 pypi_0 pypi
tinycss2 1.1.1 pyhd8ed1ab_0 conda-forge
tk 8.6.12 h1ccaba5_0
tornado 6.1 py38h27cfd23_0
tqdm 4.64.0 py38h06a4308_0
traitlets 5.3.0 pyhd8ed1ab_0 conda-forge
typing-extensions 4.3.0 pypi_0 pypi
urllib3 1.26.9 py38h06a4308_0
wcwidth 0.2.5 pyh9f0ad1d_2 conda-forge
webencodings 0.5.1 py_1 conda-forge
wheel 0.37.1 pyhd3eb1b0_0
xz 5.2.5 h7f8727e_1
zeromq 4.3.4 h9c3ff4c_1 conda-forge
zipp 3.8.0 pyhd8ed1ab_0 conda-forge
zlib 1.2.12 h7f8727e_2
zope-index 5.2.0 pypi_0 pypi
zope-interface 5.4.0 pypi_0 pypi
zstd 1.5.2 ha4553b6_0 Other observations
|
there's been a number of changes that could impact the active labeling. if you could isolate this to a specific release that would be helpful. if you could provide some example data where the current code seems to be performing worse, that would also be very helpful |
@fgregg thanks for the response. Based on previous comments from @f-hafner, I switched from 2.0.17 back to version 2.0.11, with my own fix to the KeyError issue, the training seems to be well balanced between distincts and matches now. So the issue with not enough matches must have come from 2.0.12 or later. I had this issue when testing with 2.0.17 consistently for three different datasets. At the moment, I can't share the data unfortunately because of PII in there. Interested if anyone has had this issue with any public dataset? BTW, for those interested, here is my quick fix in core.py in version 2.0.11 for the KeyError when encountered with size of the dataset between ~66,000 and ~92,000: |
could you narrow it down to a specific version between 2.0.11 and 2.0.17 |
Let me do some testing and will let you know... |
I should be able to share a sample of my dataset where the issue occurs; I'll let you know. |
thank you very much! |
@tigerang22 , I think I can confirm this. With 2.0.14, I stopped at 100 negative, 1 positive. With 2.0.13, I stopped at 22 negative, 18 positive. |
Here is the repo with data and scripts: https://github.com/f-hafner/dedupe_training_example |
@fgregg any insight on the issue and when a future release would have the fix? Thanks |
i believe i have addressed this on main @f-hafner and @tigerang22. can you confirm that it works for your cases? @f-hafner thank you for the example code, that was very helpful |
@fgregg great! I will give it a try shortly. |
@fgregg I encountered a KeyError related to the datetime field type and it turns out that your commit yesterday doesn't have variables/date_time.py anymore. Are we expected to add that as a custom type now? Please advise. |
ugh! this is probably related to #1085 |
@fgregg I solved the datetime type issue by resetting my virtual environment. Just completed test of commit aa2b04e against my previous dataset. The same problem unfortunately still exists for me, 1 match and close to 100 distinct pairs before I stopped the testing. @f-hafner have you had luck with your scenarios? |
@tigerang22. that’s unfortunate! i |
I haven't tried it out yet, but I will let you know when I have |
Hi @fgregg , @tigerang22 I tried using the github version of dedupe (also on my sample data). It still gave almost only negatives. But I am not sure I got the right version. I installed dedupe as follows:
But then Details here: https://github.com/f-hafner/dedupe_training_example What is the correct way to install the github version? |
@f-hafner looks like you installed it okay. it's a bit simpler to do it like this pip install https://github.com/dedupeio/dedupe/archive/522e7b2147d61fa36d6dee6288df57aee95c4bcc.zip that's very strange that the performance didn't get better for you. using your test repo, it seemed to be working very well for me. hmmm.... |
@fgregg is there any chance that this issue is related to the Dedupe and RecordLink DisagreementLearners if you don't already have a training file? In these situations, it seems like a randomly chosen record is used to kickoff the learning process and identify pairs of records for you to label. Is it possible that this randomly chosen record just isn't very helpful for learning initial blocking rules and setting up the active learning session? Also, since some initial blocking is occurring, I wonder if with |
i think i have a fix for this in 2.0.23 |
@fgregg Great! I will give 2.0.23 a shot. |
@fgregg I have just tested 2.0.23 and unfortunately the same issue exists. Are there any fine tuning options that might be affecting this, such as calling deduper.prepare_training(temp_d) with dynamic sample_size and blocked_proportion instead of using the default values? I did notice that the previous version such as 2.0.13 would take 4-5 mins but now with 2.0.23 it is taking more than 10 mins to finish the prepare_training call. |
@tigerang22 can you check to see if the example that @f-hafner poster also doesn't work for you. (it does for me now). |
Hello, I actually struggle with the same problem, version 2.0.23, I've tried to go a little further and stopped at 10 positives and 2000 negatives. My script is based on the pgsql_big_dedupe_example (hope it's up to date :), adapted to use Django's 3.2 ORM, as I plan to build an identity manager with Dedupe. My variables are very similar to @f-hafner, I use distinct birth, last, first and middle names (all 'Strings'), a few others (birth date, place, country, ...), and interactions to boost the scores : dedupe_fields = [ It looks like only one field is eventually used as a predicate (in my case, the logger shows it's the birth date, defined either as DateTime or String), and of course it's not enough to efficiently dedupe my 315k entries. Some entities end up with members with only the birth date as common data. Back to 2.0.13, with the same variables definition, I stopped at 47/10 positive, 1000/10 negative, and the following predicates : With this, I end up with ~3000 entities (out of ~28000 I'm supposed to find). @fgregg since it looks like it works for you, could it come from something I obviously missed in the variables definition ? Is the training engine more efficient with split birth/last/first/... names into multiple variable or to keep these in a single string ? (the question may be valid for other variables too). Or, since I have quite a lot of entries, does the training simply need a lot more samples, both with 2.0.13 and 2.0.23 ... ? Thanks for your answers. |
I have been using dedupe 2.0.6. Recently I ran into the KeyError issue with a dataset of 78,598 records. After I upgraded to version 2.017, the KeyError issue has been resolved. However as I am doing regression testing using 2.0.17 against the previous datasets, I have noticed a dramatic memory increase from 300 mb to 8-10 gb and twice as much time as version 2.06 during the deduper.prepare_training() call on my windows machine for a dataset with 121,420 records (I have a Linux app service in Azure that I had to double the size for. I haven't measured the actual memory consumption yet so can't give the metrics at the moment). The more significant problem is that although there is better sampling according to this, my training session consistently ends up with 200-300 distincts and only 3-5 matches.
@fgregg, is this problem related to sampling solely or have other things been changed since 2.06 that is causing what I am experiencing, i.e. memory, perf and not enough matches during training? I have noticed that the old sampling code, that caused the KeyError, had been moved out of core.py to convenience.py and new sampling code is now being used.
Thanks in advance. Love the great work of this project!
The text was updated successfully, but these errors were encountered: