Cell killing plot pipeline#79
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
MikeLippincott
left a comment
There was a problem hiding this comment.
LGTM, some efficiency things and other concerns with large notebooks and separation of concerns, but overall looks good!
| # Calculate the Euclidean distance for each row from the mean values | ||
| distances = np.linalg.norm(data - mean_values, axis=1) | ||
|
|
||
| # Create a new DataFrame to store distances with SampleID | ||
| new_rnaseq_data['Euclidean_Distance'] = distances |
There was a problem hiding this comment.
Consider merging these together to avoid multiple var calls
| # Print the SampleID and corresponding Euclidean Distance for each row | ||
| for idx, row in new_rnaseq_data.iterrows(): | ||
| print(f"SampleID: {idx}, Euclidean Distance: {row['Euclidean_Distance']}") | ||
|
|
There was a problem hiding this comment.
Consider adding this to a log or only printing a few rows to avoid gunking up the stdout
| latent_df = pd.DataFrame(latent_predictions, columns=["latent_score"]) | ||
|
|
||
|
|
||
| print(latent_predictions) |
There was a problem hiding this comment.
same here for the printing!
| collab_preds_dir = pathlib.Path("../7.collab-data/results").resolve() | ||
| collab_preds_dir.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| latent_pred_file = collab_preds_dir / "phgg_latent_predictions.parquet" |
There was a problem hiding this comment.
consider moving this to the top of the notebook
| # In[9]: | ||
|
|
||
|
|
||
| # Define the location of the saved models and output directory for results |
| overall_counts["percent"] = overall_counts["count"] / total_modelids * 100 | ||
|
|
||
| # 2. Subset for brain tumors: Neuroblastoma and Diffuse Glioma | ||
| brain_df = total_drugs[total_drugs["OncotreePrimaryDisease"].isin(["Neuroblastoma", "Diffuse Glioma"])] |
There was a problem hiding this comment.
are these the only brain tumors or only the ones you are interested in?
| df = compute_and_plot_latent_scores(sample, latent_df, drug_max, "name", "pearson_correlation", "Drug") | ||
| drug_merge_df.append(df) |
There was a problem hiding this comment.
consider adding this to one line to avoid writing the df in memory and then in the list, write once to avoid mem leaks
| p_df = compute_and_plot_latent_scores(sample, latent_df, reactome_max, "reactome_pathway", "nes_score", "Reactome") | ||
| c_df = compute_and_plot_latent_scores(sample, latent_df, corum_max, "reactome_pathway", "nes_score", "CORUM") | ||
| pathway_merge_df.append(p_df) | ||
| corum_merge_df.append(c_df) |
There was a problem hiding this comment.
see mem leak comment below
| # In[5]: | ||
|
|
||
|
|
||
| cell_killing_df <- auc_df |
There was a problem hiding this comment.
Consider avoid these rename calls and name the df when created
This PR processes our collaborator data and provides latent scores for cell killing comparisons.