Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize Meta #41

Open
tgross35 opened this issue Dec 31, 2022 · 3 comments
Open

Reorganize Meta #41

tgross35 opened this issue Dec 31, 2022 · 3 comments

Comments

@tgross35
Copy link
Contributor

tgross35 commented Dec 31, 2022

Currently we have something like the following. Min sizes assume empty storage, averages are a best guess of the common case:

// We count the data behind Arcs as free because we have to store that information anyway

// Ignoring string: 24 local, 56ish remote
// String: 24 local, about 64 remote
type WordList = HashMap<String, Vec<Meta>>;

// 40 local
pub struct Meta {
    stem: Arc<str>, // 16 local
    source: Source,  // 24 local
}

// 24 local
pub enum Source {
    Affix(Arc<AfxRule>, usize), // 16, 32 pointee
    Dict(Box[Arc<MorphInfo>]>), // 16 local, 24 pointee
    Personal(Box<PersonalMeta>), // 8 local, 40 pointee
    Raw,
}

// 40 local, extra meta in personal is uncommon
pub struct PersonalMeta { 
    friend: Option<Arc<str>>, // 16 local
    morph: Vec<Arc<MorphInfo>>, // 24 local
}

// 24 local, ~8 pointee
pub enum MorphInfo {
    Stem(MorphStr), /* ... */
}

// 32 local
pub struct AfxRule {
    kind: RuleType,
    can_combine: bool,
    patterns: Vec<AfxRulePattern>,
}

// 88 local
pub struct AfxRulePattern {
    affix: Box<str>,
    condition: Option<ReWrapper>,
    strip: Option<Arc<str>>,
    morph_info: Vec<Arc<MorphInfo>>,
}

That's really not terrible at ~80 bytes per entry for meta but I think we can simplify things, even outside of the storage reasons.

// Ignoring string: 24 local, 32ish remote
// String: 24 local, about 64 remote
type WordList = HashMap<String, Vec<Meta>>

// 16 local, 16 remote max
struct Meta(MetaInner);

enum MetaInner // 16 local
    DictStem(Arc<str>),
    DictMorph(Arc<MorphInfo>),
    PersonalStem(Arc<str>),
    PersonalFriend(Arc<str>),
    AfxRule(Box<AfxMeta>),
    Raw,
 }

// 16 local
struct AfxMeta {
    rule: Arc<AfxRule>,
    pat_idx: usize
}

This would mean more entries in a single vector rather than multiple entries in multiple vectors, and that's probably a good thing for various reasons. Having a flat structure rather than nested will probably make the CPU a bit happier too.

I would like to valgrind this all before actually doing the change, to get a good idea of how much we save.

@tgross35
Copy link
Contributor Author

Related: do we want to store the flags when generating words? This would might help us for easier compounding

@tgross35
Copy link
Contributor Author

Actually we probably want a few separate MetaInner fields that would point to whatever they represent

@tgross35
Copy link
Contributor Author

Why did I never consider Arc<str> instead of Arc<String>... that would probably mean some great savings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant