Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent splitting commas in SMARTS expressions #120

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

speakinside
Copy link

The SMARTS expressions with commas in the parameters like "FilterSmartsInclusionList" will be wrongly split right now, since all the commas are split in the current code.
As far as I know, the comma in SMARTS expressions only appears in paired square brackets.
Therefore, I added a new function to prevent splitting the commas in SMARTS expressions based on the above fact.
Hope it can be useful.

@sneumann
Copy link
Member

Hi, cool, thanks for the contribution, I enabled the test run.
It would be great to also have a unit test for that, could you conjure one and add it to the PR ?
Yours, Steffen

@speakinside
Copy link
Author

Sure, one test case added

@sneumann
Copy link
Member

Hi, I can confirm that things compile, the test fails with current master and succeeds with your patch.
Before merging I need to understand if possibly the comma splitting might originally have been a feature, in that you could provide multiple, comma-separated smarts, which would then be AND (or OR ?!) combined when going through the candidates. I can't remember and I am not sure @c-ruttkies is still reading the issues.
Yours, Steffen

@schymane
Copy link
Contributor

I was wondering the same when I saw this...

@speakinside
Copy link
Author

I found the problem when I used the "FilterSmartsInclusionList" parameter to pre-filter the candidates.

the value I used for this parameter was

[$(P(=O)(-O)(-O)-O-[$(c1c(-[CH3])cccc1),$(c1cc(-[CH3])ccc1),$(c1ccc(-[CH3])cc1)]),$(P(=O)(-O)(-O-[CH3])-O-c1ccccc1)], P(=O)(-O-[#6])(-O-[#6])-O-[#6]

So the right split would be [$(P(=O)(-O)(-O)-O-[$(c1c(-[CH3])cccc1),$(c1cc(-[CH3])ccc1),$(c1ccc(-[CH3])cc1)]),$(P(=O)(-O)(-O-[CH3])-O-c1ccccc1)] and P(=O)(-O-[#6])(-O-[#6])-O-[#6].

Any other comma split would result in incorrect SMARTS expressions, causing errors in the constructor function of PreProcessingCandidateSmartsExclusionFilter, and that was the case in the original code.

Also, I suspected the "FilterSmartsExclusionList" parameter would face the same issue.

So, I doubt splitting all commas would be a feature.

@schymane
Copy link
Contributor

OK so part of the problem is that (as you mentioned) commas are valid separators within SMARTS expressions contained within square brackets (as are semi-colons), but the expression you provided seems to require a comma and space combination as a separator for examples that are not in square brackets (which is two separators, not one). Is the space mandatory in the fix?
https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html

Are you able to provide a candidate list/query parameters that you used together with this SMARTS expression, to test this for ourselves?

@speakinside
Copy link
Author

No, the space is not a part of the separator. The original code and mine both trim the substring.

Here is a parameter file I used

PeakListString = 54.3690071105957_27058.451171875;55.05453109741211_74278.6484375;57.06821823120117_48136.6015625;57.06988525390625_948328.5625;59.04935836791992_32604.375;61.01066207885742_28063.423828125;62.989768981933594_17906.287109375;65.03873443603516_30178.607421875;67.05453491210938_52008.12890625;69.07018280029297_61650.3203125;73.04679870605469_230502.609375;77.03880310058594_61446.36328125;79.02117919921875_25791.615234375;79.05416107177734_79356.390625;81.07023620605469_57066.27734375;83.08582305908203_29242.76953125;85.06497955322266_15559.923828125;87.04420471191406_21584.623046875;87.36715698242188_13201.529296875;89.05963897705078_116576.34375;91.05439758300781_280626.96875;91.05768585205078_307414.46875;91.06072235107422_19726.9765625;93.07002258300781_38176.28515625;95.0490951538086_606452.4375;95.08551025390625_62714.35546875;97.10147857666016_17058.701171875;103.05426025390625_15972.8193359375;104.06221771240234_34833.0078125;105.04485321044922_72184.390625;105.07003784179688_65940.0078125;105.07392120361328_64197.87890625;107.0494155883789_62170.5546875;107.08567810058594_18448.951171875;109.10155487060547_15934.921875;115.05462646484375_18037.431640625;117.06576538085938_16135.26171875;117.0698471069336_81131.0625;118.53775787353516_14241.517578125;119.08584594726562_53468.73046875;121.1013412475586_27691.732421875;131.0859375_22171.34375;132.093505859375_32895.22265625;133.0863037109375_73942.46875;135.11685180664062_26869.81640625;136.857177734375_14529.3310546875;141.0098114013672_70642.734375;147.11672973632812_64703.1875;149.04483032226562_76752.3515625;153.07058715820312_27529.05078125;167.05543518066406_109288.6640625;169.0595245361328_12721.1005859375;177.11293029785156_22520.68359375;189.03167724609375_41526.91796875;202.07769775390625_485224.84375;203.0855255126953_182898.609375;210.18572998046875_22085.40625;248.34156799316406_12575.2294921875;265.0437316894531_31593.9765625;265.047607421875_21540.63671875;265.0625_1474505.875;266.0652770996094_15369.767578125;291.11663818359375_15457.03125;299.0623474121094_15320.0947265625;309.27880859375_233498.59375;310.281982421875_28852.89453125;314.92999267578125_18791.0703125;321.1249694824219_2690016.5;322.128173828125_32523.484375;330.9615783691406_17219.33984375;344.97607421875_65474.859375;377.18756103515625_4416049.5;378.1905517578125_87258.2578125;400.98455810546875_208835.609375;401.9857482910156_27875.16796875;402.9632263183594_68886.21875;418.9955139160156_554556.3125;419.03192138671875_25242.572265625;419.99481201171875_90851.1875;433.010986328125_31458.71484375;433.2500915527344_2850659.0;434.2517395019531_69297.953125;454.3619079589844_14023.400390625;459.02337646484375_16652.669921875;489.0550537109375_119971.4609375;489.3118591308594_679491.75;489.40399169921875_22947.23046875;489.4471435546875_65965.1875;490.0562744140625_37209.22265625
MetFragPeakListReader = de.ipbhalle.metfraglib.peaklistreader.FilteredStringTandemMassPeakListReader
MetFragDatabaseType = PubChem
NeutralPrecursorMolecularFormula = C29H45O4P
IonizedPrecursorMass = 489.313446044922
FragmentPeakMatchAbsoluteMassDeviation = 0.001
FragmentPeakMatchRelativeMassDeviation = 5
PrecursorIonMode = 1
IsPositiveIonMode = True
MetFragScoreTypes = FragmenterScore
MetFragScoreWeights = 1
MetFragCandidateWriter = CSV
SampleName = 11665_[C12]29[H1]45[O16]4[P31]_489.313446044922
ResultsPath = compute_dir
MaximumTreeDepth = 2
FilterSmartsInclusionList = [$(P(=O)(-O)(-O)-O-[$(c1c(-[CH3])cccc1),$(c1cc(-[CH3])ccc1),$(c1ccc(-[CH3])cc1)]),$(P(=O)(-O)(-O-[CH3])-O-c1ccccc1)],P(=O)(-O-[#6])(-O-[#6])-O-[#6]
MetFragPreProcessingCandidateFilter = UnconnectedCompoundFilter, SmartsSubstructureInclusionFilter
MetFragPostProcessingCandidateFilter = InChIKeyFilter
NumberThreads = 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants