Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistics ideas for master's thesis #63

Open
jonkri opened this issue Oct 3, 2024 · 18 comments
Open

Statistics ideas for master's thesis #63

jonkri opened this issue Oct 3, 2024 · 18 comments

Comments

@jonkri
Copy link

jonkri commented Oct 3, 2024

I'm about to do a master's thesis in Software Engineering. I would like to apply (Bayesian) statistics and, ideally, conduct some kind of experiment. I posted a message on Haskell-Cafe about it yesterday. I have also asked the Hackage administrator to see if I could have access to the Hackage metadata.

I was wondering if you have any suggestions for statistical questions that I could look into that would be of interest from a PVP point of view, for example some kind of analysis related to dependencies or breakages.

Thanks!

@jonkri jonkri changed the title PVP statistics ideas for master's thesis Statistics ideas for master's thesis Oct 3, 2024
@hasufell
Copy link
Member

hasufell commented Oct 3, 2024

I think it would be interesting to know:

  • how many maintainers violate PVP (e.g. missed a major bump despite API breaking changes)... also mind the corner case of re-exporting other packages API, which is a disaster in its own
  • how many maintainers do lazy major bumps although there was no API breakage (I'm looking at you @michaelpj 😁)
  • come up with some vague estimations about man-hours spent on updating one's package for one dependency (major bump) and then calculate the total amount of man-hours wasted in the entire ecosystem per, say, year (bonus points if you include GHC)
  • what are the most common bump patterns (for both major and minor)
  • what do people use the 4th and 5th etc. version components for

All the things I proposed kinda require to also have an understanding of the API of the package, not just the metadata.

I'm not sure that's within your scope. But it can be done statically.

@jonkri
Copy link
Author

jonkri commented Oct 3, 2024

Very interesting! Thank you, @hasufell!

I wonder what would be a good way of determining the API of packages. 🤔 Could GHCi's :browse/:browse! command suffice, perhaps? Or would I need to dig deeper, perhaps getting into parsing .hi files?

@ulysses4ever
Copy link

@jonkri at the Cabal project, we are looking into API checking to ensure PVP based on https://github.com/Kleidukos/print-api This package is in early development, so be warned :-) There are downsides to it (but they're probably inherent to any tool based on GHC API), which you can read about here: haskell/cabal#10259

@jonkri
Copy link
Author

jonkri commented Nov 16, 2024

@ulysses4ever: Thanks for letting me know about print-api!

Since I'm interested in analyzing API changes over time, I wonder how far back print-api and Cabal could go.

For example, do you think it would be possible to use modern Cabal to build old packages such as OpenGL-2.1 from 2006 (assuming C headers could be provided for FFI)?

$ cabal update hackage.haskell.org,2006-11-02T14:21:52Z
$ cabal get OpenGL-2.1
$ cd OpenGL-2.1
$ cabal install
Warning: Requested index-state 2006-11-02T14:21:52Z is newer than
'hackage.haskell.org'! Falling back to older state (2006-11-02T14:21:40Z).
Error: cabal: Could not resolve dependencies:
[__0] trying: OpenGL-2.1 (user goal)
[__1] next goal: OpenGL:setup.Cabal (dependency of OpenGL)
[__1] rejecting: OpenGL:setup.Cabal-3.10.3.0/installed-3.10.3.0 (conflict:
OpenGL => OpenGL:setup.Cabal>=1.0 && <1.25)
[__1] fail (backjumping, conflict set: OpenGL, OpenGL:setup.Cabal)
After searching the rest of the dependency tree exhaustively, these were the
goals I've had most trouble fulfilling: OpenGL, OpenGL:setup.Cabal

I'm not sure what “OpenGL:setup.Cabal>=1.0 && <1.25” means. Is that a constraint on the version of Cabal?

@fgaz
Copy link
Member

fgaz commented Nov 17, 2024

Yes, that's a constraint on the version of Cabal (the library) used to build the setup script (Setup.hs). It isn't specified in the OpenGL .cabal file, so it defaults to <1.25.

You can use a newer cabal-the tool to build packages that have a setup script requiring an older Cabal up to a point (depending on GHC). If you use a newer index-state (probably because of the OpenGL revision, but I'm not sure how exactly), you get a better error that includes this line:

constraint from minimum version of Cabal used by Setup.hs requires >=3.12

So that package is past the cabal-install+GHC → Cabal compatibility window.

The compatibility window is determined by a common lower bound of 1.20 plus a bound based on the GHC version you are using.

  -- GHC 8.2   needs  Cabal >= 2.0
  -- GHC 8.0   needs  Cabal >= 1.24

So you might be able to build the package by using GHC<8.2.

...or you could try to allow a newer Cabal with --allow-newer=OpenGL.setup:Cabal.

@gbaz
Copy link
Contributor

gbaz commented Nov 17, 2024

There is some prior art on some statistics in this old IFL paper. The field is sometimes called "empirical software engineering" or more specifically "mining software repositories" -- would be nice to have continued work in this regard:

https://ifl2014.github.io/submissions/ifl2014_submission_14.pdf

There is also a paper on stackage which is interesting as well: https://arxiv.org/abs/2310.10887

@ulysses4ever
Copy link

@jonkri software archeology is hard in general, and Haskell doesn't make it much easier. For one particular package, you could probably put some effort and build it way back: you'll probably have to set up older GHCs, as noted above, and for some of those you'll need an older GLIBC, and, for that, perhaps, an older OS altogether. Having native dependencies (like with OpenGL) makes it harder. Finally, doing it at scale (e.g. tens or hundreds or thousands of packages) --- I doubt it. Again, this is a hard problem in general. Depending on your goals, you may be better off by picking a more syntactic approach that wouldn't require you to compile code before you can analyze it.

@jonkri
Copy link
Author

jonkri commented Nov 18, 2024

Thanks, @gbaz and @ulysses4ever!

In order to identify API changes, do you think it would make sense to crawl the Haddock pages on Hackage, or Hoogle databases, and identify the package APIs from there? 🤔

I'm currently running the following naive script to collect print-api types for all LTS 22.40 packages (including previous versions) that build successfully with GHC 9.6.6 from a haskell:9.6.6 Docker image. $1 is the package and $2 is the version.

#!/bin/bash
set -e

echo "Fetching upload time for $1-$2..."
curl -s https://hackage.haskell.org/package/$1-$2/upload-time > /out/$1-$2-upload-time
sed -e '$a\' -i /out/$1-$2-upload-time

upload_time=$(cat /out/$1-$2-upload-time)

echo "Downloading Cabal index for $upload_time..."
cabal update hackage.haskell.org,$upload_time

echo "Extracting $1-$2..."
mkdir /package
tar -C /package -f package.tar.gz --strip-components=1 -xz
cd package

echo "Building $1-$2..."
cabal build --write-ghc-environment-files=always

echo "Extracting API from $1-$2..."
print-api -p $1 > /out/$1-$2-api

So far I've basically gotten through all packages starting with an upper case letter and gotten around 300 print-api specifications.

Edit: Mentioned Hoogle databases.

@jonkri
Copy link
Author

jonkri commented Nov 19, 2024

Looking at the Hackage package index now (in parallell to what's being discussed above). Is there anything from the Cabal metadata that you would be interested in knowing? For example, would it be interesting to know to what extent (upper or lower) bounds are used in build-depends?

@jonkri
Copy link
Author

jonkri commented Nov 19, 2024

@gbaz: Thanks again for the papers, they were an interesting read! Regarding the first paper, did you have anything specific in mind when you said it would be nice to have continued work in this regard?

@jonkri
Copy link
Author

jonkri commented Nov 20, 2024

@hasufell:

come up with some vague estimations about man-hours spent on updating one's package for one dependency (major bump) and then calculate the total amount of man-hours wasted in the entire ecosystem per, say, year (bonus points if you include GHC)

Perhaps this isn't what you meant, but I'm wondering if work related to breakages really should be seen as waste. I'm thinking that packages breaking to some extent can be seen as a natural evolution of a healthy and innovative package ecosystem, and that it's a balance. Either way, I guess it could be useful to have an estimate of the time it takes to fix a breakage. I guess the most straight-forward way to measure this would be to measure the time it takes until a package which fixes the breakage is released.

@jonkri
Copy link
Author

jonkri commented Nov 20, 2024

Here are the research questions I've considered have so far:

  • How frequently are major and minor versions bumped?
  • What are the fourth and fifth version components used for?
  • To what extent, and how, are dependency version bounds (both lower and upper) used?
  • To what extent, and how, are lock files used (checked into version control)? How up to date are the lock files? How often are lock files updated?
  • To what extent, and why, are major versions bumped (unnecessarily, against the PVP) when there are no breaking changes?
  • To what extent, and why, are minor versions bumped (incorrectly, against the PVP) when there are breaking changes?
  • How could a tool for automatically detecting versioning problems (e.g, a Cabal or Stack feature, or a GitHub Actions extension) work?
  • How long are packages remaining outdated when there are breaking changes in dependencies?
  • How much work is required to adopt to breaking changes in dependencies?
  • What effect does re-exporting other package APIs have on breakages?

Edit: “Broken” replaced with “outdated” and “fix” replaced with “adopt to”.

@phadej
Copy link
Collaborator

phadej commented Nov 20, 2024

How long are packages remaining broken when there is a breaking changes in dependencies?
How much work is required to fix breaking changes in dependencies?

I don't like the usage of broken in this context (And fix) That implies that downstream developer had made some mistake. They didn't, they cannot predict the future.

Use outdated and update/adopt:

How long are packages remaining outdated when there is breaking change in their dependencies?
How much works is required to adopt to breaking change in dependencies?

@tomjaguarpaw
Copy link
Member

I don't think it implies that the downstream developer made some mistake. If I'm hit by a car whose driver violated the speed limit and my leg is broken as a result, that doesn't imply I made a mistake. It implies that something damaged me indeed, and the task may lie with me to improve the situation. but not as a result of my fault.

@jonkri
Copy link
Author

jonkri commented Nov 20, 2024

Thanks, @phadej! I updated the questions.

Here's another question:

  • How can the time it takes to adopt to a breaking change be reduced?

For example, I'm thinking it could be useful if package maintainers could be notified when something breaks.

@phadej
Copy link
Collaborator

phadej commented Nov 20, 2024

I don't think it implies that the downstream developer made some mistake. If I'm hit by a car whose driver violated the speed limit and my leg is broken as a result, that doesn't imply I made a mistake. It implies that something damaged me indeed, and the task may lie with me to improve the situation. but not as a result of my fault.

That's interesting example.

If I'm hit by a car

That's related language issue. By some it's considered as blame removal (car driver was just sitting in a car), removing agency from a person behind the steering wheel (of a car). Google about the topic.

While I agree that "Pedestrian was killed by a car" is acceptable in a non-formal discussion, it's bad news headline. Similarly, if someone is writing a thesis, they should take care to use as good language as they can.

So, please don't imply any extra blame on downstream having bounds according to agreed version policy.
If someone opens a ticket to a library saying that "Your library is broken: there are restricting upper bounds, and there is compilation error if bounds are relaxed; my knee jerk reaction may be "It's not broken, I didn't tell that it supports GHC-9.10; in fact I did tell that it doesn't (or at least I don't know), that's why bounds are there" and close the ticket as invalid.

IMHO better approach is to open ticket with a positive language, "Add support for GHC-9.10" etc. that's a feature request not a bug report.

@jonkri
Copy link
Author

jonkri commented Nov 20, 2024

  • what do people use the 4th and 5th etc. version components for

@hasufell: What's the 5th version component? Do you mean version tags, perhaps? If so, they are not supported anymore.

@hasufell
Copy link
Member

What's the 5th version component?

I don't know. That's the question. PVP does not specify it.

See the spec:

A package version number SHOULD have the form A.B.C, and MAY optionally have any number of additional components

It is not limited to 4 components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants