Can Kimi replace Claude in our AI work?

assigned: johannesm
status: completed

Description

Claude is kind of expensive for the team to keep running on it. It's worth looking into other options to deliver good service while being cost-efficient.

This issue tracks experiments with Kimi as a potential replacement to Claude for GN work.

Objectives

Compare quality of generation between Kimi models and Claude models in agentic system in GN

Weigh options between Kimi and Claude for GN code generation

Tasks

[X] Setup Kimi with DSPy
[X] Experiment with kimi models
[X] Compare AI search feedback between Kimi and Claude models
[X] Setup pi and kimi code
[X] Compare code generation capabilities of Kimi and Claude

1. Setup Kimi with DSPy

Kimi is relatively recent compared to Claude and other large models.

It can use an openai interface to DSPy but it's probably best to look for a direct support. I read that LiteLLM offers direct support.

So I was able to write:

kimi_lm = dspy.LM(model="moonshot/kimi-k2-0711-preview, api_key="XXXX")

dspy.configure(lm=kimi_lm)

Testing it with:

answer = dspy.ChainOfThought("question -> answer: str")

shows that everything is working fine.

To use it in gnais:

https://github.com/genenetwork/gn-ai/tree/467adea973376367d359471d889730426e64f7a0/aisearch

I just need to change environment variables `MODEL_NAME` and `API_KEY`

That worked too!

2. Experiment with kimi models

As opposed to Claude, Kimi models are still under development and list of official model names for AI are not as well known.

Looking into different options through search revealed that I can use:

(a) kimi-k2-0711-preview
(b) kimi-k2-0905-preview
(c) kimi-k2-thinking
(d) kimi-k2.5

I have tried them one by one.

Options (a), (b), (d) were successful. Option (c) was not clear becomes it took more than 20 min running while it usually takes longer. I skipped it.

3. Compare AI search feedback between Kimi and Claude models

Having selected 03 kimi models to proceed with:

(a) kimi-k2-0711-preview
(b) kimi-k2-0905-preview
(c) kimi-k2.5

I compare their performance against:

(a) Claude Haiku 4.5
(b) Claude Sonnet 4
(c) Claude Sonnet 4.5

I used GNAIS's hybrid search:

https://github.com/genenetwork/gn-ai/blob/467adea973376367d359471d889730426e64f7a0/aisearch/src/gnais/ragent/search.py

and the following query for comparison:

QUERY: Digest BXD datasets related to cognition

kimi-k2-0711-preview

Feedback:

{
  "BXD cognition datasets": {
    "dataset-level resources": [
      {
        "name": "BXD Aged Hippocampus",
        "focus": "hippocampal gene-expression & phenotypes in aged mice",
        "relevance": "learning & memory",
        "url": "https://rdf.genenetwork.org/v1/id/set_B6D2RI"
      },
      {
        "name": "BXD JAX Open Field Study",
        "focus": "open-field behavioral assays",
        "relevance": "exploration, anxiety, locomotion",
        "url": "https://rdf.genenetwork.org/v1/id/set_BXD_JAX_OFS"
      },
      {
        "name": "BXD UTHSC MBD Hippocampus",
        "focus": "hippocampal molecular & behavioral data",
        "relevance": "cognition research",
        "url": "https://rdf.genenetwork.org/v1/id/set_BXD_MBD_UTHSC"
      },
      {
        "name": "BXD JAX AD Study",
        "focus": "Alzheimer’s-disease model phenotypes & molecular data",
        "relevance": "cognitive decline & memory impairment",
        "url": "https://rdf.genenetwork.org/v1/id/set_BXD_JAX_AD"
      }
    ],
    "individual cognitive traits": [
      {
        "trait_id": "BXD_13369",
        "task": "5-choice serial reaction time task (5CSRT)",
        "measure": "motivation (completed trials in 25 min)",
        "interpretation": "low value = low motivation",
        "url": "https://cd.genenetwork.org/show_trait?trait_id=13369&dataset=BXDPublish"
      },
      {
        "trait_id": "BXD_13370",
        "task": "5CSRT",
        "measure": "attention accuracy (variable stimulus duration)",
        "interpretation": "high value = good performance",
        "url": "https://cd.genenetwork.org/show_trait?trait_id=13370&dataset=BXDPublish"
      },
      {
        "trait_id": "BXD_13371",
        "task": "5CSRT",
        "measure": "attention (errors of omission)",
        "interpretation": "low value = good performance",
        "url": "https://cd.genenetwork.org/show_trait?trait_id=13371&dataset=BXDPublish"
      },
      {
        "trait_id": "BXD_13364",
        "task": "5CSRT",
        "measure": "attention/motivation (errors of omission)",
        "interpretation": "low value = good performance",
        "url": "https://cd.genenetwork.org/show_trait?trait_id=13364&dataset=BXDPublish"
      },
      {
        "trait_id": "CCB_10022",
        "task": "fear conditioning",
        "measure": "fear acquisition (% freezing difference 1st vs 5th tone)",
        "interpretation": "fear-learning metric",
        "url": "https://cd.genenetwork.org/show_trait?trait_id=10022&dataset=CCBXDPublish"
      },
      {
        "trait_id": "CCB_10023",
        "task": "fear conditioning",
        "measure": "fear extinction (difference vs last 3 tones)",
        "interpretation": "fear-memory-extinction metric",
        "url": "https://cd.genenetwork.org/show_trait?trait_id=10023&dataset=CCBXDPublish"
      },
      {
        "trait_id": "CCB_10024",
        "task": "fear conditioning",
        "measure": "fear expression (% freezing to first 3 tones in novel context)",
        "interpretation": "fear-memory metric",
        "url": "https://cd.genenetwork.org/show_trait?trait_id=10024&dataset=CCBXDPublish"
      },
      {
        "trait_id": "CCB_10026",
        "task": "fear conditioning",
        "measure": "habituation to chamber (% freezing during 30-min habituation)",
        "interpretation": "habituation metric",
        "url": "https://cd.genenetwork.org/show_trait?trait_id=10026&dataset=CCBXDPublish"
      }
    ]
  },
  "usage_notes": "Explore the four dataset-level resources on GeneNetwork RDF for broad hippocampal, behavioral, and AD-related data. For specific cognitive phenotypes, use the eight individual trait links on the GeneNetwork CD portal."
}

I liked how clear, relevant and concise it was.

kimi-k2-0905-preview

Feedback:

{
  "cognition_datasets": {
    "alzheimer_and_neurodegeneration": {
      "dataset": "BXD NIA Alzheimer’s Studies",
      "description": "Comprehensive cognitive-decline & Alzheimer’s data with behavioral testing",
      "link": "https://cd.genenetwork.org/show_trait?trait_id=BXD_NIA_AD&dataset=BXD_NIA_AD"
    },
    "midbrain_transcriptomics": {
      "dataset": "VUBXDMouseMidBrainQ0512",
      "description": "Midbrain gene-expression relevant to cognitive function",
      "link": "https://cd.genenetwork.org/show_trait?trait_id=VUBXDMouseMidBrainQ0512&dataset=VUBXDMouseMidBrainQ0512"
    },
    "large_scale_brain_phenome": {
      "dataset": "CC BXD Tabbaa Brain n=1041 (Jan23)",
      "description": "1041-strain brain phenome with neurological & cognitive measures",
      "link": "https://cd.genenetwork.org/show_trait?trait_id=CC_BXD_TM&dataset=CC_BXD_TM"
    },
    "attention_and_motivation_5CSRT": {
      "traits": [13369, 13370, 13371, 13364],
      "description": "5-choice serial reaction-time task: completed trials, accuracy, omission errors",
      "links": [
        "https://cd.genenetwork.org/show_trait?trait_id=13369&dataset=BXDPublish",
        "https://cd.genenetwork.org/show_trait?trait_id=13370&dataset=BXDPublish",
        "https://cd.genenetwork.org/show_trait?trait_id=13371&dataset=BXDPublish",
        "https://cd.genenetwork.org/show_trait?trait_id=13364&dataset=BXDPublish"
      ]
    },
    "fear_learning_and_memory": {
      "traits": [10022, 10023, 10024, 10026],
      "panel": "CCBXD_TM",
      "description": "Fear acquisition, extinction, expression and habituation (% freezing)",
      "links": [
        "https://cd.genenetwork.org/show_trait?trait_id=10022&dataset=CCBXD_TM",
        "https://cd.genenetwork.org/show_trait?trait_id=10023&dataset=CCBXD_TM",
        "https://cd.genenetwork.org/show_trait?trait_id=10024&dataset=CCBXD_TM",
        "https://cd.genenetwork.org/show_trait?trait_id=10026&dataset=CCBXD_TM"
]
    }
  },
  "summary": "The BXD family provides multi-level cognition data: Alzheimer-focused phenotypes, midbrain transcriptomics, large-scale brain phenomes, plus standardized behavioral assays (5CSRT attention/motivation and fear-learning paradigms). All datasets are public on GeneNetwork with integrated QTL mapping tools."                                                   
}

This is quite similar to the previous one. I prefer the previous one because maintains data integrity.

kimi-k2.5

This model was showing promise because the agent discovered much more documents than previously.

Agent output:

"final_answer": "The BXDPublish dataset contains extensive cognition-related phenotypes including 6 traits for working memory deficit onset (AAO_ADBXD series), 9 social interaction measures (3C series for social approach/novelty), ethanol drinking behavior traits (DID method), and pharmacological response behaviors (nicotine, morphine, ethanol locomotor activity). These traits cover cognitive domains including learning and memory, social cognition, anxiety-like behavior, and drug-related behavioral responses. All traits are accessible via the GeneNetwork Classic Display (CD) interface using the provided links with format: https://cd.genenetwork.org/show_trait?trait_id={ID}&dataset=BXDPublish."

Unfortunately, generation is unstable. The final feedback was not formatted as JSON despite instructions.

In fact, one cannot set `temperature=0` because it has been constrained by developer. In short output between different runs might not be the same.

Final feedback:

The BXD and BXDPublish datasets contain extensive cognition-related phenotypes spanning multiple behavioral domains:

**Working Memory and Spatial Cognition:**
- Y-maze spontaneous alternation measures tracked across ages 6, 8, 10, 14, and 16 months (traits 20577, 20567, 20572, 20592 for combined sexes; 20905, 20915, 20920 for males), assessing spatial working memory performance
- Age at onset of working memory deficits (AAO series): traits 20473, 20637, 20720 (AAO_ADBXD series for Alzheimer's disease models) and 20556, 20803, 20884 (AAO_NtgBXD series for non-transgenic littermates), measuring when cognitive performance drops below 50% chance
- Y-maze error indices quantifying unsuccessful spontaneous alternations (traits 20570, 20575, 20595 for both sexes; 20923 for males)

**Attention and Impulsivity:**
- 5-Choice Serial Reaction Time Task (5CSRT) measures: completed trials as motivation indicator (13369), response accuracy with variable stimulus duration as attention measure (13370), and errors of omission as attention/motivation measures (13364, 13371)

**Visual Discrimination and Learning:**
- Touchscreen-based visual discrimination assays: mean reaction time (16210), sessions required to discriminate rewarded versus unrewarded stimuli (16204), and extinction learning of instrumental responses (16223)
- Spontaneous exploratory activity measured via magazine entries in touchscreen chambers (17938)

**Social Cognition:**
- 3-chamber social interaction paradigm: social approach preference (3C_SA_ChambPref 19306, 3C_SA_SniffPref 19307), social novelty preference (3C_SN_ChambPref 19308, 3C_SN_SniffPref 19309), and habituation behaviors including center latency and entries (19311, 19312)

**Drug Response and Ethanol Behaviors:**
- Drinking in the Dark (DID) method: 20% alcohol intake across weeks 1-5 (20010-20014) and difference scores measuring binge-like drinking patterns (20285, 20190, 20323, 20228, 20348, 20253)
- Drug-induced locomotor activity: responses to nicotine (0_5NicPostCon 14805, 0_5NicPref 14803), morphine (AMCNT series 11826-11834), and ethanol (ACTI series 11954-11966)

**Access Information:**
All traits are accessible via GeneNetwork Classic Display using URLs formatted as:
- https://cd.genenetwork.org/show_trait?trait_id={ID}&dataset=BXD
- https://cd.genenetwork.org/show_trait?trait_id={ID}&dataset=BXDPublish

**Data Sources:**
These phenotypes derive from studies referenced in PubMed IDs 22098762 and 24586288, as well as unpublished datasets, providing comprehensive coverage of cognitive phenotypes in the BXD recombinant inbred population.

Claude Haiku 4.5

Feedback:

{
  "query": "Digest BXD datasets related to cognition",
  "summary": "A comprehensive analysis of BXD (BXD recombinant inbred lines) datasets containing cognition-related traits was conducted. The investigation revealed multiple cognition-related phenotypes available in the GeneNetwork database.",
  "key_findings": {
    "dataset_availability": "BXD datasets are confirmed to exist in the GeneNetwork RDF database and web interface",
    "cognition_traits_identified": 9,
    "trait_categories": [
      {
        "category": "5-Choice Serial Reaction Time Task (5CSRT)",
        "description": "Measures of attention, motivation, and motor function",
        "traits": [
          {
            "id": "BXD_13369",
            "measure": "Number of completed trials in 25-minute session",
            "mean": 56.35,
            "url": "https://cd.genenetwork.org/show_trait?trait_id=13369&dataset=BXDPublish"
          },
          {
            "id": "BXD_13368",
            "measure": "Mean latency to retrieve food pellet (high value/low motivation)",
            "mean": "2.72 seconds",
            "url": "https://cd.genenetwork.org/show_trait?trait_id=13368&dataset=BXDPublish"
          },
          {
            "id": "BXD_13365",
            "measure": "Mean correct response latency (attention/motivation/motor)",
            "mean": "0.98 seconds",
            "url": "https://cd.genenetwork.org/show_trait?trait_id=13365&dataset=BXDPublish"
          },
          {
            "id": "BXD_13370",
            "measure": "Response accuracy with variable stimulus duration (attention)",
            "mean": "75.21%",
            "url": "https://cd.genenetwork.org/show_trait?trait_id=13370&dataset=BXDPublish"
          },
          {
            "id": "BXD_13363",
            "measure": "Response accuracy (attention)",
            "mean": "84.19%",
            "url": "https://cd.genenetwork.org/show_trait?trait_id=13363&dataset=BXDPublish"
          }
        ]
      },
      {
        "category": "Morris Water Maze",
        "description": "Learning and memory function assessment",
        "traits": [
          {
            "id": "BXD_10345",
            "measure": "Swim time for probe trial on day 5",
            "mean": "31.45 seconds",
            "url": "https://cd.genenetwork.org/show_trait?trait_id=10345&dataset=BXDPublish"
          }
        ]
      },
      {
        "category": "Touchscreen Assays",
        "description": "Visual discrimination and extinction learning",
        "traits": [
          {
            "id": "BXD_16210",
            "measure": "Mean reaction time to nose-poke stimulus (visual discrimination)",
            "mean": "8.49 seconds",
            "url": "https://cd.genenetwork.org/show_trait?trait_id=16210&dataset=BXDPublish"
          },
          {
            "id": "BXD_16223",
            "measure": "Correct responses during extinction learning",
            "mean": 202.98,
            "url": "https://cd.genenetwork.org/show_trait?trait_id=16223&dataset=BXDPublish"
          }
        ]
      },
      {
        "category": "Spontaneous Activity",
        "description": "Behavioral activity measures",
        "traits": [
          {
            "id": "BXD_17938",
            "measure": "Magazine entries in 30 minutes (2.5-3.5 month old females)",
            "mean": "61.00 entries",
            "url": "https://cd.genenetwork.org/show_trait?trait_id=17938&dataset=BXDPublish"
          }
        ]
      }
    ]
  },
  "access_information": {
    "primary_resource": "GeneNetwork (https://genenetwork.org)",
    "database": "BXDPublish dataset",
    "note": "Initial SPARQL queries on the RDF endpoint had limitations in retrieving cognition-related traits, but the web interface provides direct access to all identified traits"
  },
  "conclusion": "The BXD datasets contain a rich collection of cognition-related phenotypes spanning multiple cognitive domains including attention, motivation, learning, memory, and behavioral activity. These traits are suitable for genetic mapping and systems genetics studies to identify genetic variants influencing cognitive function."
}

This shares some similarities with the feedback from kimi-k2. However the feedback with Claude Haiku seems to be more elaborated than the one with kimi-k2.

Claude Sonnet 4

Feedback:

{
  "overview": "BXD datasets provide a comprehensive collection of cognition-related phenotypes spanning behavioral, molecular, and disease-focused studies across multiple cognitive domains.",
  
  "primary_datasets": {
    "BXDPublish": {
      "description": "Contains numerous cognitive and behavioral phenotypes including anxiety assays, motor learning tasks, and maze-based tests",
      "key_traits": [
        {
          "trait_id": "18794",
          "name": "Activity_Anxiety_Eigen",
          "url": "https://cd.genenetwork.org/show_trait?trait_id=18794&dataset=BXDPublish"
        },
        {
          "trait_id": "21179", 
          "name": "Complex_motor_learning",
          "url": "https://cd.genenetwork.org/show_trait?trait_id=21179&dataset=BXDPublish"
        },
        {
          "trait_id": "21249",
          "name": "General_Motor_learning", 
          "url": "https://cd.genenetwork.org/show_trait?trait_id=21249&dataset=BXDPublish"
        },
        {
          "trait_id": "10964",
          "name": "Anxiety_LightSide",
          "url": "https://cd.genenetwork.org/show_trait?trait_id=10964&dataset=BXDPublish"
        },
        {
          "trait_id": "11012",
          "name": "PlusMazeOpen",
          "url": "https://cd.genenetwork.org/show_trait?trait_id=11012&dataset=BXDPublish"
        },
        {
          "trait_id": "11013",
          "name": "PlusMazeClosedIn",
          "url": "https://cd.genenetwork.org/show_trait?trait_id=11013&dataset=BXDPublish"
        }
      ]
    },
    
    "alzheimers_disease_datasets": {
      "BXD_NIA_AD": {
        "description": "Focuses on Alzheimer's disease research using crosses between 5XFAD females and BXD males to produce F1 transgenic carriers",
        "features": "Behavioral and physical phenotypes tested at 6 and 14 months of age for cognitive decline and resilience studies",
        "url": "http://rdf.genenetwork.org/v1/id/dataset_BXD_NIA_AD"
      },
      "JAX_BXD_AD_GSE119408_1124": {
        "description": "Studies normal cognitive aging and Alzheimer's disease effects using Cg.5XFAD females bred to BXD males",
        "features": "F1 progeny monitored throughout lifespan to evaluate genetic background effects on cognitive and pathological traits",
        "url": "http://rdf.genenetwork.org/v1/id/dataset_JAX_BXD_AD_GSE119408_1124"
      }
    },
    
    "brain_region_datasets": {
      "DOD_BXD_PFC_CD_RNA_Seq_1019": {
        "region": "Prefrontal cortex",
        "type": "RNA-seq",
        "url": "http://rdf.genenetwork.org/v1/id/dataset_DOD_BXD_PFC_CD_RNA_Seq_1019"
      },
      "NIAAA_BXD_Hip_CMS_RNAseq1020": {
        "region": "Hippocampus", 
        "type": "RNA-seq",
        "url": "http://rdf.genenetwork.org/v1/id/dataset_NIAAA_BXD_Hip_CMS_RNAseq1020"
      },
      "UTHSC_BXD_AgeHipp0515": {
        "region": "Aged hippocampus",
        "type": "Expression data",
        "url": "http://rdf.genenetwork.org/v1/id/dataset_UTHSC_BXD_AgeHipp0515"
      },
      "JAX_BXD_Hip_Pro": {
        "region": "Hippocampus",
        "type": "Proteomics",
        "datasets": [
          "http://rdf.genenetwork.org/v1/id/dataset_JAX_BXD_Hip_Pro_0219",
          "http://rdf.genenetwork.org/v1/id/dataset_JAX_BXD_Hip_Pro_0723"
        ]
      }
    }
  },
  
  "cognitive_domains": {
    "attention_executive_function": {
      "description": "5-choice serial reaction time task (5CSRT) measures using Med Associates 5-hole operant chambers",
      "traits": [
        {
          "trait_id": "BXD_13369",
          "measure": "Completed trials",
          "url": "http://rdf.genenetwork.org/v1/id/trait_BXD_13369"
        },
        {
          "trait_id": "BXD_13370", 
          "measure": "Response accuracy with variable stimulus duration",
          "url": "http://rdf.genenetwork.org/v1/id/trait_BXD_13370"
        },
        {
          "trait_id": "BXD_13364",
          "measure": "Errors of omission",
          "url": "http://rdf.genenetwork.org/v1/id/trait_BXD_13364"
        },
        {
          "trait_id": "BXD_13371",
          "measure": "Errors of omission (additional)",
          "url": "http://rdf.genenetwork.org/v1/id/trait_BXD_13371"
        }
      ]
    },
    
    "learning_memory": {
      "y_maze_tests": {
        "description": "Y-maze spontaneous alternation performance measuring working memory and spatial cognition",
        "age_groups": ["6 months", "8 months", "10 months", "14 months", "16 months"],
        "sample_traits": [
          {
            "trait_id": "BXD_20577",
            "url": "http://rdf.genenetwork.org/v1/id/trait_BXD_20577"
          },
          {
            "trait_id": "BXD_20567", 
            "url": "http://rdf.genenetwork.org/v1/id/trait_BXD_20567"
          },
          {
            "trait_id": "BXD_20572",
            "url": "http://rdf.genenetwork.org/v1/id/trait_BXD_20572"
          }
        ]
      },
      
      "visual_discrimination": {
        "description": "Touchscreen assays for visual learning",
        "traits": [
          {
            "trait_id": "BXD_16204",
            "measure": "Sessions to discriminate between rewarded and unrewarded stimuli",
            "url": "http://rdf.genenetwork.org/v1/id/trait_BXD_16204"
          },
          {
            "trait_id": "BXD_16210",
            "measure": "Reaction time measurements", 
            "url": "http://rdf.genenetwork.org/v1/id/trait_BXD_16210"
          },
          {
            "trait_id": "BXD_16223",
            "measure": "Extinction learning",
            "url": "http://rdf.genenetwork.org/v1/id/trait_BXD_16223"
          }
        ]
      }
    },
    
    "cognitive_aging": {
      "description": "Age-related cognitive decline measures",
      "traits": [
        {
          "trait_id": "BXD_20556",
          "measure": "Age at onset of working memory deficits",
          "url": "http://rdf.genenetwork.org/v1/id/trait_BXD_20556"
        },
        {
          "trait_id": "BXD_20803",
          "measure": "Age at onset of working memory deficits (female-specific)",
          "url": "http://rdf.genenetwork.org/v1/id/trait_BXD_20803"
        }
      ]
    },
    
    "emotional_learning": {
      "description": "Fear conditioning paradigms in CC-BXD datasets",
      "traits": [
        {
          "trait_id": "CCB_10022",
          "measure": "Fear acquisition",
          "url": "http://rdf.genenetwork.org/v1/id/trait_CCB_10022"
        },
        {
          "trait_id": "CCB_10023",
          "measure": "Fear extinction",
          "url": "http://rdf.genenetwork.org/v1/id/trait_CCB_10023"
        },
        {
          "trait_id": "CCB_10024", 
          "measure": "Fear expression",
          "url": "http://rdf.genenetwork.org/v1/id/trait_CCB_10024"
        },
        {
          "trait_id": "CCB_10026",
          "measure": "Habituation to testing apparatus",
          "url": "http://rdf.genenetwork.org/v1/id/trait_CCB_10026"
        }
      ]
    }
  },
  
  "specialized_studies": {
    "neurotransmitter_systems": {
      "VUBXDMouseMidBrainQ0512": {
        "focus": "Serotonin (5-HT) homeostasis across BXD strains",
        "measures": "5-HT and 5-HIAA levels in midbrain and other regions",
        "url": "http://rdf.genenetwork.org/v1/id/dataset_VUBXDMouseMidBrainQ0512"
      }
    },
    
    "aging_pathology": {
      "JAX_BXD_Hip_Pro_0723": {
        "focus": "Polyglucosan bodies (PGBs) in aging hippocampus",
        "relevance": "Genetic modifiers affecting PGB burden and links to learning and memory through Pgb1 locus",
        "url": "http://rdf.genenetwork.org/v1/id/dataset_JAX_BXD_Hip_Pro_0723"
      }
    },
    
    "human_cognition": {
      "HCP_datasets": {
        "description": "Human Connectome Project personality-based cognitive measures",
        "example_trait": {
          "trait_id": "51_10296",
          "measure": "Openness to new experiences (NEO-five factor inventory)",
          "url": "http://rdf.genenetwork.org/v1/id/trait_51_10296"
        }
      }
    }
  },
  
  "research_applications": {
    "multi_level_approach": "Datasets provide molecular expression to behavioral phenotypes",
    "genetic_mapping": "QTL identification for cognitive traits",
    "disease_modeling": "Alzheimer's disease and cognitive aging studies", 
    "comparative_studies": "Cross-species validation with human datasets",
    "aging_research": "Longitudinal cognitive assessments across lifespan"
  },
  
  "key_resources": {
    "main_portal": "https://cd.genenetwork.org/",
    "rdf_endpoint": "http://rdf.genenetwork.org/",
    "primary_reference": "http://rdf.ncbi.nlm.nih.gov/pubmed/22098762"
  }
}

This must be the most detailed and satisfactory feedback I have so far.

Claude Sonnet 4.5

Feedback:

{
  "overview": "BXD datasets provide comprehensive coverage of cognitive phenotypes across multiple behavioral domains, enabling genetic mapping of cognitive traits and identification of quantitative trait loci (QTLs) associated with various aspects of cognition.",
  "cognitive_domains": {
    "working_memory": {
      "description": "Assessment of working memory and age-related cognitive decline in both Alzheimer's disease models and non-transgenic populations",
      "paradigms": [
        "Y-maze spontaneous alternation tests measuring working memory across ages 6-16 months",
        "8-arm radial maze tests measuring spatial working memory in males and females at 11-13 weeks",
        "Age at onset of working memory deficits in AD-BXD (5XFAD transgenic) and non-transgenic BXD populations"
      ],
      "measurements": [
        "Alternating arm returns (error indices)",
        "Arm visit time across multiple days",
        "Performance dropping below 50% chance level (impairment threshold)"
      ],
      "populations": ["AD-BXD transgenic (5XFAD)", "Non-transgenic BXD"],
      "sex_specific_data": true,
      "age_ranges": ["6 months", "8 months", "10 months", "11-13 weeks", "14 months", "16 months"]
    },
    "fear_conditioning_and_contextual_memory": {
      "description": "Measurement of associative learning, fear memory formation, expression, and extinction",
      "paradigms": [
        "Contextual fear memory (CFM) at different ages",
        "Fear conditioning with context and tone measurements",
        "Fear acquisition, expression, and extinction protocols"
      ],
      "measurements": [
        "Freezing behavior in response to context",
        "Freezing behavior in response to tone",
        "Baseline response",
        "Alternative context responses"
      ],
      "age_points": ["6 months", "14 months", "15 months"],
      "populations": ["AD-BXD", "Non-transgenic controls"]
    },
    "attention_and_executive_function": {
      "description": "Assessment of sustained attention, impulsivity, cognitive flexibility, and executive control",
      "paradigms": [
        "5-choice serial reaction time task (5CSRT)",
        "NIH toolbox dimensional change card sort tests",
        "Touchscreen-based visual discrimination assays"
      ],
      "measurements": [
        "Number of completed trials (motivation)",
        "Response accuracy (attention)",
        "Errors of omission (attention/motivation)",
        "Card sort adjusted and unadjusted scores",
        "Sessions to criterion",
        "Reaction time",
        "Extinction learning"
      ],
      "cognitive_constructs": ["Attention", "Motivation", "Impulsivity", "Cognitive flexibility", "Behavioral flexibility"]
    },
    "learning_and_memory": {
      "description": "Measurement of learning acquisition, task performance, and memory consolidation",
      "paradigms": [
        "Touchscreen chamber learning tasks",
        "Visual discrimination learning",
        "Trials to criterion measurements"
      ],
      "measurements": [
        "Learning speed",
        "Initial acquisition of tasks",
        "Trials to criterion",
        "Spontaneous activity in naive adults"
      ]
    },
    "motor_learning_and_coordination": {
      "description": "Assessment of motor skill acquisition and coordination",
      "paradigms": [
        "Rotarod performance under continuous acceleration",
        "Complex wheel tasks"
      ],
      "measurements": [
        "Performance improvements over trials",
        "Differences in performance across developmental stages"
      ],
      "age_ranges": ["90-120 days"],
      "sex_specific_data": true
    },
    "reward_learning_and_decision_making": {
      "description": "Measurement of reward-related learning and decision-making processes",
      "paradigms": [
        "Cocaine-induced conditioned place preference (CPP)",
        "Delay discounting tasks"
      ],
      "measurements": [
        "Baseline measurements in drug and vehicle compartments",
        "Time in drug-paired compartments",
        "Changes in preference",
        "Percent change in conditioning",
        "Area under curve (AUC) at different reward magnitudes (200, 40K)"
      ]
    },
    "social_cognition": {
      "description": "Assessment of social recognition, social preference, and social interaction behaviors",
      "paradigms": [
        "Three-chamber social interaction tasks",
        "Social dominance tasks",
        "Direct social interaction tasks"
      ],
      "measurements": [
        "Social approach preference (chamber and sniffing)",
        "Social novelty preference (chamber and sniffing)",
        "Habituation center latency and time",
        "Chamber entries during habituation, social approach, and social novelty phases",
        "Wins over cagemates and strangers",
        "Sniffing duration",
        "Aggression measures"
      ]
    },
    "cognitive_development": {
      "description": "Measurement of cognitive growth and development in early life",
      "paradigms": [
        "Cognitive growth fostering measures from parent-child interaction assessments"
      ],
      "measurements": [
        "CGF contingency total scores across visits (CV1, CV2, CV3)",
        "CGF subscale total scores across visits",
        "Cognitive risk at 1 year"
      ]
    }
  },
  "anxiety_and_stress_related_measures": {
    "description": "Assessment of anxiety-like behaviors that can affect cognitive performance",
    "paradigms": [
      "Elevated plus maze (EPM)",
      "Elevated zero maze"
    ],
    "measurements": [
      "Time in open arms",
      "Latency to enter open areas",
      "Center duration",
      "Closed arm entries",
      "Open arm entries",
      "Total entries",
      "Activity levels under restraint stress"
    ],
    "experimental_manipulations": ["Restraint stress", "Pharmacological treatments"]
  },
  "experimental_design_features": {
    "sex_differences": "Separate measurements for males and females across multiple paradigms",
    "age_ranges": "Comprehensive coverage from early development (11-13 weeks) through aging (16 months)",
    "longitudinal_tracking": "Age at onset measurements and repeated assessments across time points",
    "disease_models": "Alzheimer's disease models (5XFAD transgenic) compared with non-transgenic controls",
    "environmental_manipulations": ["Normal chow vs. high-fat diet", "Restraint stress", "Pharmacological treatments"],
    "genetic_reference_population": "BXD recombinant inbred strains enabling QTL mapping"
  },
  "data_applications": [
    "Genetic mapping of cognitive traits",
    "Identification of quantitative trait loci (QTLs) associated with cognition",
    "Studies of age-related cognitive decline",
    "Gene-environment interactions affecting cognition",
    "Genetic influences on cognitive function across domains",
    "Modeling of Alzheimer's disease-related cognitive deficits",
    "Sex-specific genetic effects on cognition"
  ],
  "key_references": [
    "http://rdf.ncbi.nlm.nih.gov/pubmed/22098762",
    "http://rdf.ncbi.nlm.nih.gov/pubmed/24586288"
  ],
  "data_access": {
    "platform": "GeneNetwork (genenetwork.org)",
    "trait_viewer": "https://cd.genenetwork.org/show_trait",
    "rdf_endpoints": "http://rdf.genenetwork.org/v1/id/"
  }
}

Feedback from Claude Sonnet 4.5 is very similar to the one from Claude Sonnet 4. However, Sonnet 4 emphasized not only cognitive domains but also primary dataset. Sonnet 4.5 did not.

I would rate feedback from Sonnet 4 higher than one from Sonnet 4.5.

Final take

On Kimi models

1. Kimi models are viable replacement options to Claude. They provide satisfactory, clear and concise answers.

2. For reliability, prefer kimi-k2-0711-preview.

3. Need to be careful with kimi-k2.5 in terms of instruction following (for time being).

On Claude models

1. Sonnet 4 might be better at handling search tasks than Sonnet 4.5.

2. Claude haiku is still an option after Sonnet.

3. Claude haiku is doing slightly better than kimi-k2.

On Claude vs Kimi

The following ranking place Kimi models in context with Claude's:

1. Claude Sonnet 4 and Sonnet 4.5

2. Claude Haiku

3. Kimi-k2-0711-preview and Kimi-k2-0905-preview

That being said, the comparison was only made on 1 query. We will keep experimenting more to get a better understanding of the performance difference between the models.

4. Setup pi and kimi code

pi is a coding tool that harnesses coding agents for code generation and automation. It offers an interface similar to claude code or kimi code.

We are interested in finding out if it offers some advantages over the use of kimi code or claude code.

Setting up pi... See:

https://github.com/badlogic/pi-mono/tree/main/packages/coding-agent

I also set up kimi code to make a comparison. That was pretty easy: See:

https://www.kimi.com/code?from=kimi_homepage_sidebar&track_id=f8f92548-6e94-4fe7-a95f-f5bba1d9c59c

To get to use kimi code or kimi with pi, one must authenticate using `/login` command or setting up API key in the bash environment.

With `/login` command, I had to already be logged in my Kimi account from the web interface, which is:

https://www.kimi.com/websites

For the API key, I struggled to get started because Kimi has more than one API. The two I know are:

One gives access to raw kimi models (the one I used with AI search):

https://platform.moonshot.ai/console/account

The other is an API for kimi code specifically:

https://www.kimi.com/code/console?from=kfc_overview_topbar

So I had to subscribe to a plan to generate API key and get started

5. Compare code generation capabilities of Kimi and Claude

The interest is to compare performance between the following options:

pi + kimi (kimi-k2.5)
pi + claude (claude sonnet 4.5)
kimi code

I test performance on RAG design with the query:

QUERY: I want to write a RAG that takes documents from a text file and embed them into a Chroma database to answer a question using semantic and keyword (BM25) search. Help me generate a ready to use codebase. The code should be complete and well abstracted. The RAG should take the path to documents, path to database if it has already been created and embedding model coming from HuggingFace. Put all the code in a new directory in my home.

I have been a bit vague to see how they will handle complexity and make decision.

I then reviewed the code generated by each of the options and did a dry run.

I pushed the code generated on a repo. Have a look to make your own opinion at:

https://github.com/johanmed/rag-claude-kimi/tree/1d40e71f7ba5236e2927de2b536824020fa424ad/test

pi + kimi (kimi-k2.5)

In about 10 min, I got a full codebase.

Token consumed could not be estimated because kimi console does not show token usage by dollars. A plan gives a specific quota of tokens which usage is monitored in percentage. This can be convenient because we do not have to worry about tokens as long as the limit usage is not hit :)

#### What I liked

Comprehensive and ready to use codebase. kimi-k2.5 generated everything from modules to documentation and cli tool

No syntax errors. The code ran fine. That's impressive!

#### What I did not like

Project was not very well structured. Everything was in a the home project directory

Answer generated by the RAG was not ready for user. No synthesis was performed. The answer was:

==================================================
ANSWER
==================================================
Based on the retrieved documents, here are the relevant excerpts for: 'What is RAG?'

==================================================

[1] Document: rag_systems | Chunk: 0 | Source: example_documents.txt
Score: 0.0388
----------------------------------------
Retrieval-Augmented Generation (RAG) is a technique that enhances large language models by retrieving relevant information from an external knowledge base before generating a response. RAG combines the strengths of retrieval-based and generation-based approaches. It addresses the limitations of LLMs such as hallucination and outdated knowledge.                       

A typical RAG pipeline consists of three main components: document ingestion, retrieval, and generation. During ingestion, documents are split into chun...

[2] Document: embeddings | Chunk: 0 | Source: example_documents.txt
Score: 0.0265
----------------------------------------
Embeddings are numerical representations of text, images, or other data in a high-dimensional vector space. In the context of natural language processing, word embeddings map words or phrases to vectors of real numbers. These representations capture semantic meaning such that words with similar meanings are located closer together in the vector space.

Sentence embeddings extend this concept to entire sentences or paragraphs. Models like BERT, SBERT, and MPNet generate dense vector representatio...

[3] Document: chroma_db | Chunk: 0 | Source: example_documents.txt
Score: 0.0048
----------------------------------------
Chroma is the open-source embedding database. Chroma makes it easy to build LLM apps by making knowledge, facts, and skills pluggable for LLMs. It provides a simple API for storing and 
querying embeddings with metadata. Chroma is designed to be simple enough to get started with quickly and powerful enough for production use cases.

Chroma supports various distance metrics including cosine similarity, L2 distance, and inner product. It allows for filtering by metadata and supports both in-memory ...
==================================================

I was expecting a final and well synthesized for the user. Here RAG just returns relevant documents with an excerpts of each which is not the end goal.

My instructions did not specify that but it would have been nice for the coding agent to figure that :)

Something was off with the code computing relevance score for documents. A document that was so relevant for RAG could not get a score of 0.0388.

Agent wrote all the logic from scratch, which is not efficient. It would have been smarter to discover libraries on the fly and reuse them. Would have prevented errors in computation of relevance score

pi + claude (claude sonnet 4.5)

It took about 6 mins to get the codebase with Claude Sonnet 4.5.

Token consumed estimated to about $0.4.

Reviewing and testing the codebase...

The codebase was a bit different from the one generated by kimi-k2.5.

#### What I liked

Codebase was complete from modules to cli tool and ready to use.

Project was better organized. Source code in a dedicated directory.

A quickstart guide was added to the README documentation.

Coding logic and conveniences were a bit more advanced. Import all behavior was controlled. cli tool offered an option to choose between keyword, semantic and hybrid search.

However, the core logic was the same as the one provided by kimi-k2.5.

Relevance scores for the documents are more realistic and satisfactory. See answer below.

#### What I did not like

Claude made an error in the cli tool preventing its execution. I had to fix it manually.

Claude wrote the whole logic from scratch too. Still not very efficient but at least there was no bug in the RAG logic itself.

Would have been great to see Claude discover constructs from LangChain and reuse to go faster and more reliably.

Final and comprehensive synthesis was not generated by Claude either

Answer was:

Question: What is a neural network?
Search type: hybrid
--------------------------------------------------------------------------------

Retrieved 5 documents:

[Rank 1] Score: 0.7280
Text: cal neural networks that constitute animal brains. Such systems learn to perform tasks by considering examples, generally without being programmed with task-specific rules. A neural network consists o...                                                                                                                                                                
Metadata: {'end_char': 2913, 'source_file': '/export4/johannesm/pi_claude_sonnet4.5/example_data.txt', 'chunk_id': 7, 'start_char': 2451}
--------------------------------------------------------------------------------
[Rank 2] Score: 0.6206
Text: arn from patterns and inference derived from data. Deep Learning Deep learning is a specialized branch of machine learning that uses neural networks with multiple layers (deep neural networks) to prog...                                                                                                                                                                
Metadata: {'start_char': 265, 'chunk_id': 1, 'end_char': 683, 'source_file': '/export4/johannesm/pi_claude_sonnet4.5/example_data.txt'}
--------------------------------------------------------------------------------
[Rank 3] Score: 0.6090
Text: most commonly applied to analyzing visual imagery. They are inspired by biological processes and the connectivity pattern between neurons resembles the organization of the animal visual cortex. Indivi...                                                                                                                                                                
Metadata: {'start_char': 2863, 'chunk_id': 8, 'end_char': 3354, 'source_file': '/export4/johannesm/pi_claude_sonnet4.5/example_data.txt'}
--------------------------------------------------------------------------------
[Rank 4] Score: 0.5877
Text: rithm achieves an acceptable level of performance. Unsupervised Learning Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a dataset with no pre-exis...                                                                                                                                                                
Metadata: {'start_char': 2024, 'end_char': 2501, 'source_file': '/export4/johannesm/pi_claude_sonnet4.5/example_data.txt', 'chunk_id': 6}
--------------------------------------------------------------------------------
[Rank 5] Score: 0.4526
Text: s form a directed graph along a temporal sequence. This allows them to exhibit temporal dynamic behavior. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequ...                                                                                                                                                                
Metadata: {'source_file': '/export4/johannesm/pi_claude_sonnet4.5/example_data.txt', 'start_char': 3304, 'end_char': 3816, 'chunk_id': 9}
--------------------------------------------------------------------------------

Perhaps, I have not been specific enough in my query and/or I am expecting models to be more smart than they are.

kimi code

With kimi code, I had the same code generated by pi+kimi. So, you can refer to that.

This confirmed that kimi code uses kimi-k2.5 and is pretty reproducible.

However, I noticed that with kimi code, one is not able to choose the kimi model to use for coding. This is possible with the setup pi+kimi. I could switch from kimi-k2.5 to kimi-k2-thinking from pi.

Final take

kimi-k2.5 is very capable at generating code for GN. It might just need more iterations and reviews than what's needed for Claude Sonnet 4.5. For example, can give follow-up prompts to make kimi produce a better project structure.

For complex coding tasks, Claude Sonnet 4.5 will probably be best.

It's safer to not expect either models to be as smart as programmers and take care of all aspects to satisfaction. In this experiment, they both failed to generate useful final synthesis to the user of the RAG. That means the prompt needs to be very specific and detailed to not let the coding agent for example the choice to write whole logic from scratch for example.

Using the pi setup instead of kimi code or even claude code is more advantageous. Generation of pi+kimi-k2.5 and kimi code are the same and with pi, one can easily switch between different kimi models or even providers (i.e kimi to claude). I personally liked the pi interface better; it's more informative and very extensible.