-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Statistics ideas for master's thesis #63
Comments
I think it would be interesting to know:
All the things I proposed kinda require to also have an understanding of the API of the package, not just the metadata. I'm not sure that's within your scope. But it can be done statically. |
Very interesting! Thank you, @hasufell! I wonder what would be a good way of determining the API of packages. 🤔 Could GHCi's |
@jonkri at the Cabal project, we are looking into API checking to ensure PVP based on https://github.com/Kleidukos/print-api This package is in early development, so be warned :-) There are downsides to it (but they're probably inherent to any tool based on GHC API), which you can read about here: haskell/cabal#10259 |
@ulysses4ever: Thanks for letting me know about print-api! Since I'm interested in analyzing API changes over time, I wonder how far back print-api and Cabal could go. For example, do you think it would be possible to use modern Cabal to build old packages such as OpenGL-2.1 from 2006 (assuming C headers could be provided for FFI)?
I'm not sure what “OpenGL:setup.Cabal>=1.0 && <1.25” means. Is that a constraint on the version of Cabal? |
Yes, that's a constraint on the version of Cabal (the library) used to build the setup script (Setup.hs). It isn't specified in the OpenGL .cabal file, so it defaults to <1.25. You can use a newer cabal-the tool to build packages that have a setup script requiring an older Cabal up to a point (depending on GHC). If you use a newer index-state (probably because of the OpenGL revision, but I'm not sure how exactly), you get a better error that includes this line:
So that package is past the cabal-install+GHC → Cabal compatibility window. The compatibility window is determined by a common lower bound of 1.20 plus a bound based on the GHC version you are using.
So you might be able to build the package by using GHC<8.2.
|
There is some prior art on some statistics in this old IFL paper. The field is sometimes called "empirical software engineering" or more specifically "mining software repositories" -- would be nice to have continued work in this regard: https://ifl2014.github.io/submissions/ifl2014_submission_14.pdf There is also a paper on stackage which is interesting as well: https://arxiv.org/abs/2310.10887 |
@jonkri software archeology is hard in general, and Haskell doesn't make it much easier. For one particular package, you could probably put some effort and build it way back: you'll probably have to set up older GHCs, as noted above, and for some of those you'll need an older GLIBC, and, for that, perhaps, an older OS altogether. Having native dependencies (like with OpenGL) makes it harder. Finally, doing it at scale (e.g. tens or hundreds or thousands of packages) --- I doubt it. Again, this is a hard problem in general. Depending on your goals, you may be better off by picking a more syntactic approach that wouldn't require you to compile code before you can analyze it. |
Thanks, @gbaz and @ulysses4ever! In order to identify API changes, do you think it would make sense to crawl the Haddock pages on Hackage, or Hoogle databases, and identify the package APIs from there? 🤔 I'm currently running the following naive script to collect print-api types for all LTS 22.40 packages (including previous versions) that build successfully with GHC 9.6.6 from a #!/bin/bash
set -e
echo "Fetching upload time for $1-$2..."
curl -s https://hackage.haskell.org/package/$1-$2/upload-time > /out/$1-$2-upload-time
sed -e '$a\' -i /out/$1-$2-upload-time
upload_time=$(cat /out/$1-$2-upload-time)
echo "Downloading Cabal index for $upload_time..."
cabal update hackage.haskell.org,$upload_time
echo "Extracting $1-$2..."
mkdir /package
tar -C /package -f package.tar.gz --strip-components=1 -xz
cd package
echo "Building $1-$2..."
cabal build --write-ghc-environment-files=always
echo "Extracting API from $1-$2..."
print-api -p $1 > /out/$1-$2-api So far I've basically gotten through all packages starting with an upper case letter and gotten around 300 print-api specifications. Edit: Mentioned Hoogle databases. |
Looking at the Hackage package index now (in parallell to what's being discussed above). Is there anything from the Cabal metadata that you would be interested in knowing? For example, would it be interesting to know to what extent (upper or lower) bounds are used in |
@gbaz: Thanks again for the papers, they were an interesting read! Regarding the first paper, did you have anything specific in mind when you said it would be nice to have continued work in this regard? |
Perhaps this isn't what you meant, but I'm wondering if work related to breakages really should be seen as waste. I'm thinking that packages breaking to some extent can be seen as a natural evolution of a healthy and innovative package ecosystem, and that it's a balance. Either way, I guess it could be useful to have an estimate of the time it takes to fix a breakage. I guess the most straight-forward way to measure this would be to measure the time it takes until a package which fixes the breakage is released. |
Here are the research questions I've considered have so far:
Edit: “Broken” replaced with “outdated” and “fix” replaced with “adopt to”. |
I don't like the usage of broken in this context (And fix) That implies that downstream developer had made some mistake. They didn't, they cannot predict the future. Use outdated and update/adopt:
|
I don't think it implies that the downstream developer made some mistake. If I'm hit by a car whose driver violated the speed limit and my leg is broken as a result, that doesn't imply I made a mistake. It implies that something damaged me indeed, and the task may lie with me to improve the situation. but not as a result of my fault. |
Thanks, @phadej! I updated the questions. Here's another question:
For example, I'm thinking it could be useful if package maintainers could be notified when something breaks. |
That's interesting example.
That's related language issue. By some it's considered as blame removal (car driver was just sitting in a car), removing agency from a person behind the steering wheel (of a car). Google about the topic. While I agree that "Pedestrian was killed by a car" is acceptable in a non-formal discussion, it's bad news headline. Similarly, if someone is writing a thesis, they should take care to use as good language as they can. So, please don't imply any extra blame on downstream having bounds according to agreed version policy. IMHO better approach is to open ticket with a positive language, "Add support for GHC-9.10" etc. that's a feature request not a bug report. |
@hasufell: What's the 5th version component? Do you mean version tags, perhaps? If so, they are not supported anymore. |
I don't know. That's the question. PVP does not specify it. See the spec:
It is not limited to 4 components. |
I'm about to do a master's thesis in Software Engineering. I would like to apply (Bayesian) statistics and, ideally, conduct some kind of experiment. I posted a message on Haskell-Cafe about it yesterday. I have also asked the Hackage administrator to see if I could have access to the Hackage metadata.
I was wondering if you have any suggestions for statistical questions that I could look into that would be of interest from a PVP point of view, for example some kind of analysis related to dependencies or breakages.
Thanks!
The text was updated successfully, but these errors were encountered: