Simon-nofib-notes

	--------------------------------------------
		Notes on nofib programs
	--------------------------------------------

Every time I do performance tuning on GHC I re-discover wrinkles in
particular nofib programs.   I intend to record these wrinkles herein.


NOTE: With 4.09 we introduced Unicode.  That means that (C# x) always allocates,
whereas it didn't before.  So allocations go up a bit.

---------------------------------------
	Imaginary suite
---------------------------------------

queens
~~~~~~
The comprehension
    gen n = [ (q:b) | b <- gen (n-1), q <- [1..nq], safe q 1 b]

has, for each iteration of 'b', a new list [1..nq]. This can floated
and hence and shared, or fused.  It's quite delicate which of the two
happens.

integrate
~~~~~~~~~
integrate1D is strict in its second argument 'u', but it also passes 'u' to
function 'f'.  Hence it now does some reboxing, which pushes up allocation
slightly.

gen_regexps
~~~~~~~~~~~
I found that there were some very bad loss-of-arity cases in PrelShow.
  In particular, we had:

	showl ""       = showChar '"' s
	showl ('"':xs) = showString "\\\"" . showl xs
	showl (x:xs)   = showLitChar x . showl xs

  Trouble is, we get
	showl = \xs -> case xs of
			  ...
			  (x:xs) -> let f = showLitChar x
					g = showl xs
				    in \s -> f (g s)
  which is TERRIBLE.  We can't spot that showLitChar has arity 2 because
  it looks like this:

	...other eqns...
        showLitChar c = showString ('\\' : asciiTab!!ord c)

  notice that the (asciiTab!!orc c) is outside the \s, so GHC can't rewrite it to

	showLitChar c =  \s -> showString ('\\' : asciiTab!!ord c) s

  So I've changed PrelShow.showLitChar to use explicit \s.  Even then, showl
  doesn't work, because GHC can't see that showl xs can be pushed inside the \s.
  So I've put an explict \s there too.

	showl ""       s = showChar '"' s
	showl ('"':xs) s = showString "\\\"" (showl xs s)
	showl (x:xs)   s = showLitChar x (showl xs s)

  Net result: imaginary/gen_regexps more than halves in allocation!

queens
~~~~~~
If we do
 a) some inlining before float-out
 b) fold/build fusion before float-out
then queens get 40% more allocation.  Presumably the fusion
prevents sharing.


x2n1
~~~~

Relies quite heavily on specialisations for Num (Complex Double).  If
this test suddenly gets worse by a factor of 2 or so, chances are that
specialisation is broken.

---------------------------------------
	Spectral suite
---------------------------------------

rewrite
~~~~~~~
It's important to inline p_ident.

There's a very delicate CSE in p_expr
   p_expr = seQ q_op [p_term1, p_op, p_term2] ## p_term3

(where all the pterm1,2,3 are really just p_term).

This expands into
   p_expr s = case p_term1 s of
   	        Nothing -> p_term3 s
		Just (v,s') -> case p_op s' of ...
		     	         Nothing -> p_term3 s

Notice that if the bit before (##) fails, we call p_term3 s, and if
CSE notices we can re-use the result of the first call.  But that
depends on how much inlining takes place.  In particular, the follow-up
calls p_term3 can get hidden inside join points, which don't "see" the first
call.


cse
~~~
The functions 'go' and 'go1', called from draw, get arity 1, but they
should get arity 2. We need a better arity analysis, a la Gill. 'go' looks
like this:
    go_r1og =
      \ (ds_aGI :: [GHC.Base.Char]) ->
        case ds_aGI of wild_aGJ {
          [] -> z_r1oe;
          : y_aGN ys_aGO ->
    	let {
    	  xs7_s1i8 :: GHC.Prim.Int# -> [GHC.Base.Char]
    	  [Str: DmdType]
    	  xs7_s1i8 = go_r1og ys_aGO
    	} in
    	  \ (m_XWf :: GHC.Prim.Int#) ->
    	    case GHC.Prim.<=# m_XWf 1 of wild1_aSI {
    	      GHC.Base.False ->
    		GHC.Base.: @ GHC.Base.Char y_aGN (xs7_s1i8 (GHC.Prim.-# m_XWf 1));
    	      GHC.Base.True -> GHC.Base.: @ GHC.Base.Char y_aGN lvl13_r1n0
    	    }
        }
Notice the 'let' which stops the lambda moving out.


eliza
~~~~~
In June 2002, GHC 5.04 emitted four successive
    NOTE: Simplifier still going after 4 iterations; bailing out.
messages.  I suspect that the simplifier is looping somehow.

fibheaps
~~~~~~~~
If you don't inline getChildren, allocation rises by 25%

hartel/event
~~~~~~~~~~~~
There's a functions called f_nand and f_d, which generates tons of
code if you inline them too vigorously.  And this can happen because
of a massive result discount.

Moreover if f_d gets inlined too much, you get lots of local lvl_xx
things which make some closures have lots of free variables, which pushes
up allocation.

expert
~~~~~~
In spectral/expert/Search.ask there's a statically visible CSE. Catching this
depends almost entirely on chance, which is a pity.

reptile
~~~~~~~
Performance dominated by (++) and Show.itos'

fish
~~~~
The performance of fish depends crucially on inlining scale_vec2.
It turns out to be right on the edge of GHC's normal threshold size, so
small changes to the compiler can knock it to and fro.

[Sept 08]  The new instance-declaration story makes show slightly worse.
Look at Main.showl in the simplified core.  You'll see this:

  a79_aRZ :: GHC.Types.Int
             -> (GHC.Types.Int, GHC.Types.Int, GHC.Types.Int, GHC.Types.Int)
             -> GHC.Show.ShowS
  [Str: DmdType]
  a79_aRZ =
    GHC.Show.showsPrec9 @ Int @ Int @ Int @ Int
    	  GHC.Show.$f16 GHC.Show.$f16 GHC.Show.$f16 GHC.Show.$f16

  Rec {
  showl_sWK :: [Main.Line_segment] -> String -> String
  [Arity 2
   Str: DmdType SL]
  showl_sWK =
    \ (ds_dQJ :: [Main.Line_segment]) (s_alB :: GHC.Base.String) ->
      case ds_dQJ of wild_X1d [ALWAYS Just A] {
        [] -> GHC.Types.: @ GHC.Types.Char lvl_sYT s_alB;
        : x_alD [ALWAYS Just L] xs_alF [ALWAYS Just L] ->
          GHC.Base.++
            @ GHC.Types.Char
            a_s12R
            (a79_aRZ a_s1uS x_alD (showl_sWK xs_alF s_alB))
      }
  end Rec }

Note the non-inlined call to GHC.Show.showsPrec9, which looks like this:
  showsPrec9 :: forall a b c d. (Show a, Show b, Show c, Show d) =>
                GHC.Types.Int -> (a, b, c, d) -> GHC.Show.ShowS
    {- Arity: 4 Strictness: LLLL
       Unfolding: (\ @ a b c d
                     $dShow :: GHC.Show.Show a
                     $dShow1 :: GHC.Show.Show b
                     $dShow2 :: GHC.Show.Show c
                     $dShow3 :: GHC.Show.Show d ->
                   let {
                     shows1 :: d -> GHC.Show.ShowS = GHC.Show.shows @ d $dShow3
                   } in
                   let {
                     shows2 :: c -> GHC.Show.ShowS = GHC.Show.shows @ c $dShow2
                   } in
                   let {
                     shows3 :: b -> GHC.Show.ShowS = GHC.Show.shows @ b $dShow1
                   } in
                   let {
                     shows4 :: a -> GHC.Show.ShowS = GHC.Show.shows @ a $dShow
                   } in
                   \ ds :: GHC.Types.Int ds1 :: (a, b, c, d) s :: GHC.Base.String ->
                   case @ GHC.Base.String ds1 of wild { (a79, b, c, d) ->
                   GHC.Show.show_tuple
                     (GHC.Types.:
                        @ GHC.Show.ShowS
                        (shows4 a79)
                        (GHC.Types.:
                           @ GHC.Show.ShowS
                           (shows3 b)
                           (GHC.Types.:
                              @ GHC.Show.ShowS
                              (shows2 c)
                              (GHC.Types.:
                                 @ GHC.Show.ShowS
                                 (shows1 d)
                                 (GHC.Types.[] @ GHC.Show.ShowS)))))
                     s }) -}

We would do better to inpline showsPrec9 but it looks too big.  Before
it was inlined regardless by the instance-decl stuff.  So perf drops slightly.


integer
~~~~~~~
A good benchmark for beating on big-integer arithmetic

There is a delicate interaction of fusion and full laziness in the comprehension
  integerbench :: (Integer -> Integer -> a)
  	       -> Integer -> Integer -> Integer
               -> Integer -> Integer -> Integer
               -> IO ()
  integerbench op astart astep alim bstart bstep blim = do
    seqlist ([ a `op` b
  	   | a <- [ astart,astart+astep..alim ]
  	   , b <- [ bstart,astart+bstep..blim ]])
    return ()

and the analogous one for Int.

Since the inner loop (for b) doesn't depend on a, we could float the
b-list out; but it may fuse first.  In GHC 8 (and most previous
version) this fusion did happen at type Integer, but (accidentally) not for
Int because an interving eval got in the way.  So the b-enumeration was floated
out, which led to less allocation of Int values.

Knights
~~~~~~~
* In knights/KnightHeuristic, we don't find that possibleMoves is strict
  (with important knock-on effects) unless we apply rules before floating
  out the literal list [A,B,C...].

* Similarly, in f_se (F_Cmp ...) in listcompr (but a smaller effect)

* If we don't inline $wmove, we get an allocation increase of 17%


lambda
~~~~~~
This program shows the cost of the non-eta-expanded lambdas that arise from
a state monad.

mandel2
~~~~~~~
check_perim's several calls to point_colour lead to opportunities for CSE
which may be more or less well taken.

mandel
~~~~~~
Relies heavily on having a specialised version of Complex.magnitude
(:: Complex Double -> Double) available.

Mandel.mandelset has a go-loop inside an enumDeltaToIntegerFB, out of which
4.08.2 managed to float a constant expression, but HEAD did not.  I think
this is because the pre-let-floating simplification did too little inlining;
in particular, it did not inline windowToViewport


multiplier
~~~~~~~~~~
In spectral/multiplier, we have
    xor = lift21 forceBit f
      where f :: Bit -> Bit -> Bit
	    f 0 0 = 0
	    f 0 1 = 1
	    f 1 0 = 1
	    f 1 1 = 0
  Trouble is, f is CPR'd, and that means that instead of returning
  the constants I# 0, I# 1, it returns 0,1 and then boxes them.
  So allocation goes up.  I don't see a way around this.


hartel/partsof
~~~~~~~~~~~~~~
spectral/hartel/parstof ends up saying
	case (unpackCString "x") of { c:cs -> ... }
  quite a bit.   We should spot these and behave accordingly.


power
~~~~~
With GHC 4.08, for some reason the arithmetic defaults to Double.  The
right thing is to default to Rational, which accounts for the big increase
in runtime after 4.08


puzzle
~~~~~~
The main function is 'transfer'.  It has some complicated join points, and
a big issue is the full laziness can float out many small MFEs that then
make much bigger closures.  It's quite delicate: small changes can make
big differences, and I spent far too long gazing at it.

I found that in my experimental proto 4.09 compiler I had

	let ds = go xs in
	let $j = .... ds ... in
	case e of
	  p1 -> $j
	  p2 -> $j
	  p3 -> ds

But ds wasn't getting inlined because we couldn't spot that it was actually
only being used once.   Keith's analyser should find this!


Also, making concat into a good producer made a large gain.

My proto 4.09 still allocates more, partly because of more full laziness relative
to 4.08; I don't know why that happens

Extra allocation is happening in 5.02 as well; perhaps for the same reasons.  There is
at least one instance of floating that prevents fusion; namely the enumerated lists
in 'transfer'.

sphere
~~~~~~
A key function is vecsub, which looks like this (after w/w)

$wvecsub
  = \ ww :: Double  ww1 :: Double ww2 :: Double
      ww3 :: Double ww4 :: Double ww5 :: Double ->
	let { a = case ww of wild { D# x ->
	      	  case ww3 of wild1 { D# y ->
	      	  let { a1 = -## x y
	      	  } in  $wD# a1
	      	  } } } in
	let { a1 = case ww1 of wild { D# x ->
	      	   case ww4 of wild1 { D# y ->
	      	   let { a2 = -## x y
	      	   } in  $wD# a2
	      	   } } } in
	let { a2 = case ww2 of wild { D# x ->
	      	   case ww5 of wild1 { D# y ->
	      	   let { a3 = -## x y
	      	   } in  $wD# a3
	      	   } }
	} in  (# a, a1, a2 #)

Currently it gets guidance: IF_ARGS 6 [2 2 2 2 2 2] 25 4
It occurs in a context like

	case $wvecsub a b c d e f of  (# p, q, r #) ->
	case p of		      D# p' ->
	case q of		      D# q' ->
        case r of		      D# r' ->

So it would be much, much better to inline it.  But the result-discounting
doesn't "see" that the three results are each taken apart, and that's the
problem.

One question is this: why did minusDouble get inlined? If it didn't then
$vecsub would look much smaller:

$wvecsub
  = \ ww :: Double  ww1 :: Double ww2 :: Double
      ww3 :: Double ww4 :: Double ww5 :: Double ->
	let { a  = minusDouble ww  ww3 } in
	let { a1 = minusDouble ww1 ww4 } in
	let { a2 = minusDouble ww2 ww5 } in
	(# a, a1, a2 #)

Reason minusDouble gets inlined: because we are trying to get update in place.
So I added a flag -funfolding-update-in-place to enable this "optimisation".
Omitting the flag gives much better inlining for $wvecsub at least.


Sphere also does 60,000 calls to hPutStr, so I/O plays a major role.  Currently
this I/O does a *lot* of allocation, much of it since the addition of thread-safety.


treejoin
~~~~~~~~
Does a lot of IO.readFile.  In GHC.IO.Encoding.UTF8 the demand
analyser sees a strict function with type
 a_s1gj :: GHC.IO.Buffer.Buffer GHC.Word.Word8
     	-> GHC.IO.Buffer.Buffer GHC.Types.Char
     	-> GHC.Prim.State# GHC.Prim.RealWorld
     	-> (# GHC.Prim.State# GHC.Prim.RealWorld,
     	      (GHC.IO.Encoding.Types.CodingProgress,
     	       GHC.IO.Buffer.Buffer GHC.Word.Word8,
     	       GHC.IO.Buffer.Buffer GHC.Types.Char) #)
Unboxing both Buffer arguments makes a HUGE difference (halves
allocation); but that makes the worker function have 12 arguments.  A
good reason for unboxing even if the worker gets a lot of args.

sorting
~~~~~~~
Same issue with GHC.IO.Encoding.UTF8 as treejoin


---------------------------------------
	Real suite
---------------------------------------

gg
~~
Same issue with GHC.IO.Encoding.UTF8 as treejoin

Maillist
~~~~~~~~
Uses appendFile repeatedly rather than opening the output file once,
which leads to numerous file opens/closes.  Allocations will rise with
the new I/O subsystem in 5.02 because the I/O buffer will be
re-allocated on the heap for each open, whereas previously it would be
allocated on the C heap and therefore not show up in the stats.

---------------------------------------
	Shootout suite
---------------------------------------

binary-tree
~~~~~~~~~~~

In May 2016, a series of seemingly unrelated commits changed the runtime
performance of this up and down by ~3%. Maybe a performance cliff. Mailinglist
thread:
https://mail.haskell.org/pipermail/ghc-devs/2017-March/013886.html