Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalid values #229

Open
lwaldron opened this issue Sep 4, 2023 · 6 comments
Open

invalid values #229

lwaldron opened this issue Sep 4, 2023 · 6 comments
Assignees
Labels

Comments

@lwaldron
Copy link
Member

lwaldron commented Sep 4, 2023

The following line in bugphyzzExports is identifying invalid values and dropping them. @sdgamboa please raise such curation issues here and discuss whether they should be resolved by correcting the invalid values, adding to the allowed vocabulary, or continuing to drop these values. For some, dropping certainly does seem like the right choice for ASR, but for others (like aerophilicity and shapes) I'm not so sure.

https://github.com/waldronlab/bugphyzzExports/blob/a9fc18914cb3b1d9ea3a3d1c0121ccac5c8d482a/inst/scripts/export_bugphyzz.R#L126

[1] "Invalid values for aerophilicity: "
# A tibble: 3 × 2
  Attribute_group Attribute         
  <chr>           <chr>             
1 aerophilicity   facultative aerobe
2 aerophilicity   microaerotolerant 
3 aerophilicity   positive          
[1] "Invalid values for biosafety level: "
# A tibble: 6 × 2
  Attribute_group Attribute                                           
  <chr>           <chr>                                               
1 biosafety level "biosafety level Risk group (German classification)"
2 biosafety level "biosafety level 11o58'14.4\\\""                    
3 biosafety level "biosafety level Germany"                           
4 biosafety level "biosafety level 1+"                                
5 biosafety level "biosafety level 3**"                               
6 biosafety level "biosafety level L1"                                
[1] "Invalid values for disease association: "
# A tibble: 13 × 2
   Attribute_group     Attribute                                      
   <chr>               <chr>                                          
 1 disease association caries                                         
 2 disease association periodontal disorder                           
 3 disease association Infection caused by Escherichia coli (disorder)
 4 disease association Endocarditis                                   
 5 disease association Meningitis                                     
 6 disease association Periodontal Disorder                           
 7 disease association Infection                                      
 8 disease association arthritis                                      
 9 disease association meningitis septicemia                          
10 disease association septicemia arthritis                           
11 disease association Fever                                          
12 disease association urlnary tract infection                        
13 disease association Tetnus                                         
[1] "Invalid values for growth medium: "
# A tibble: 2,191 × 2
   Attribute_group Attribute                                                                                  
   <chr>           <chr>                                                                                      
 1 growth medium   NUTRIENT AGAR (DSMZ Medium 1)                                                              
 2 growth medium   Marine agar (MA)                                                                           
 3 growth medium   R2A MEDIUM (DSMZ Medium 830)                                                               
 4 growth medium   ACETIVIBRIO MEDIUM (DSMZ Medium 122)                                                       
 5 growth medium   Zobell marine agar (ZMA)                                                                   
 6 growth medium   MEDIUM 1 - for Acetobacter, Azotobacter, Gluconobacter, Gluconacetobacter, Mesorhizodium c7 growth medium   MEDIUM 85 - for Abiotrophia                                                                
 8 growth medium   GS2 agar plates                                                                            
 9 growth medium   TRYPTICASE SOY YEAST EXTRACT MEDIUM (DSMZ Medium 92)                                       
10 growth medium   MLO agar                                                                                   
# ℹ 2,181 more rows
# ℹ Use `print(n = ...)` to see more rows
[1] "Invalid values for shape: "
# A tibble: 20 × 2
   Attribute_group Attribute         
   <chr>           <chr>             
 1 shape           square            
 2 shape           vibriod cell      
 3 shape           rod-shaped        
 4 shape           coccus-shaped     
 5 shape           filament-shaped   
 6 shape           ellipsoidal       
 7 shape           pleomorphic-shaped
 8 shape           ovoid-shaped      
 9 shape           oval-shaped       
10 shape           other             
11 shape           sphere-shaped     
12 shape           spiral-shaped     
13 shape           curved-shaped     
14 shape           helical-shaped    
15 shape           vibrio-shaped     
16 shape           ring-shaped       
17 shape           spore-shaped      
18 shape           crescent-shaped   
19 shape           star-shaped       
20 shape           diplococcus-shaped
> 
@sdgamboa
Copy link
Contributor

sdgamboa commented Sep 4, 2023

I think these come from the output of bacdiveR. @jwokaty, I've been using this spreadsheet, is there a newer version? Those from "biosafety level" seem to be incorrect parsing. I'll add the remaining values to the extdata/attributes.tsv file.

@jwokaty
Copy link
Contributor

jwokaty commented Sep 12, 2023

@sdgamboa I've created a new spreadsheet and it seems that the biosafety level, country, and geographic location appear to be formatted correctly; however, I have not yet replaced the BacDive sheet yet. I wanted to give you the opportunity to look at it first: https://docs.google.com/spreadsheets/d/1P4Ic6-N9GVXcX1CdfoamFt6eozfHqt-sxfIRTBvYHWk/edit?usp=sharing. If it looks good, I want to upload it as a new version to the BacDive document.

@sdgamboa
Copy link
Contributor

sdgamboa commented Sep 20, 2023

@jwokaty, thanks! Values for biosafety level seem fine now and I no longer get 'X' columns when parsing the file. I added the url to this code:

bugphyzz/R/bacdive.R

Lines 21 to 29 in ed8b40f

.importBacDiveExcel <- function(verbose = FALSE) {
if (verbose)
message('Importing BacDive...')
# url <- 'https://docs.google.com/spreadsheets/d/1smQTi1IKt4wSGTrGTW25I6u47M5txZkq/export?format=csv'
url <- 'https://docs.google.com/spreadsheets/d/1P4Ic6-N9GVXcX1CdfoamFt6eozfHqt-sxfIRTBvYHWk/export?format=csv'
# bacdive_data <- .cleanBD(utils::read.csv(url))
bacdive <- utils::read.csv(url)
colnames(bacdive) <- tolower(colnames(bacdive))
return(bacdive)
. Please let me known if I new URL is needed or if you overwrite the previous spreadsheet.

library(bugphyzz)
bl <- physiologies('biosafety level')[[1]]
#> Finished biosafety level.
#> Warning: Missing columns in biosafety level. Missing columns are: Genome_ID,
#> Accession_ID
unique(bl$Attribute)
#> [1] "biosafety level 1"   "biosafety level 2"   "biosafety level 3"  
#> [4] "biosafety level 1+"  "biosafety level 3**" "biosafety level L1"

Created on 2023-09-20 with reprex v2.0.2

@jwokaty
Copy link
Contributor

jwokaty commented Sep 25, 2023

@sdgamboa I'm glad that it's working better. I think that we should use the original URL as we can make use of Google Sheet versioning. It only keeps a version history of 30 days but it will allow us to upload a new version without changing the URL in bugphyzz.

@sdgamboa
Copy link
Contributor

@jwokaty, agreed. I'll switch back to the original URL when the spreadsheet gets updated.

@jwokaty
Copy link
Contributor

jwokaty commented Sep 26, 2023

I've updated the google sheet!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants