Tuesday, March 6, 2018

Five Years of American Opportunity : U.S. Permanent Visas



Answer the following questions to better understand the storyboard above.
  1. What is the trend of all permanent visa applications over time? In which year did the number of applications reach its maximum, and in which year is the minimum? (Slide 1)
  2. What year has the greatest percent of certified H-1B visas? (Slide 3)
  3. In 2016, from which countries were positions in Law applied? (Slide 4)
  4. In what ways are applications for employment in Healthcare clustered differently than those for Industrial employment? What about Management? (Slide 5)

Friday, March 2, 2018

Comparing Network Density of Obese and Lean Samples and Their First OTU Neighbor


Seen above is a network developed from the microbial communities of fecal samples from patients labeled as "Obese" and "Lean". There have been a lot of studies which have measured what's called the "beta diversity" to show that dissimilarity exists between the microbial communities of obese individuals, and I was curious about the conclusions one could draw from the perspective of a network.

The blue nodes are samples that are "Lean", while the red nodes are samples labeled "Obese". The distal pink nodes are what are called "Operational Taxonomic Units" which represent microorganisms, and classifies groups of closely related individuals. The nodes are sized based on their degree, which represents the frequency that connections are made to the node. The purple lines are transparent based on a function of their edge weight, and the animation fades between the selection of "Obese" samples and their first neighbor in the network only, and the respective "Lean" view.

The network is configured using an "Edge-weighted Spring-Embedded Layout"; from the Cytoscape webpage, the spring-embedded layout is based on a “force-directed” paradigm as implemented by Kamada and Kawai (1988). Network nodes are treated like physical objects that repel each other, such as electrons. The connections between nodes are treated like metal springs attached to the pair of nodes. These springs repel or attract their end points according to a force function. The layout algorithm sets the positions of the nodes in a way that minimizes the sum of forces in the network.

The seemingly "explosive" nature of the visualization with respect to "Obese" samples translates to more connections in the network being made to a more diverse community of microorganisms, and supports the notion that obese samples have a more diverse gut microbiome when compared to lean samples. What's interesting is that the orientation is preserved, so shared OTU nodes are easily spotted, while the large increase in nodes for "Obese" samples can be readily observed. These shared OTUs are representative of the core shared microbiome.

Background


As a part of my Computational Biology course at the University of Washington, I was tasked with creating a statistically-backed visualization of a biological process or simulation. I had previously done an exploratory analysis of Jeff Gordon's A Core Gut Microbiome of Obese and Lean Twins, and I was interested in whether there were measurable and visual differences in the network developed from microbial communities of lean and obese twins. In my previous exploratory analysis, I had measured the dissimilarity of the sample operational taxonomic unit using a weighted-unifrac metric, which resulted in a measurable dissimilarity of the beta diversity of obese samples, which is pretty interesting to me! Jeff Gordon's study drew three core conclusions, which directed this project:
  1. Wide array of shared genes; there exists a core microbiome at the gene level.
  2. Obesity is associated with phylum-level changes in the microbiota.
  3. Deviations from this core microbiome are associate with physiological states.

Visualization


A link to the specific data can be found here:
The following preprocessing was done using Qiime2 Python scripts:
  • Rarify the table for increased accuracy.
single_rarefaction.py
-i 'sample.biom'
-o 'sample_1000.biom'
-d 1000
view raw bio_1.py hosted with ❤ by GitHub

  • Filter any samples that you are not analyzing. Here, we do not want 'Overweight' samples.
filter_samples_from_otu_table.py
-i 'sample_1000.biom'
-m 'mapping_file.txt'
-o 'sample1000_filtered.biom'
--output_mapping_fp 'mapping_file_filtered.txt'
-s 'obesitycat:*,!Overweight'
view raw bio_2.py hosted with ❤ by GitHub

  • Make the otu network with the filtered biom and mapping file, here we wanted properties based on "obesitycat"
make_otu_network.py
-i 'sample_1000_filtered.biom'
-m mapping_file_filtered.txt
-o otu_network_filtered
-b "obesitycat"
view raw bio_3.py hosted with ❤ by GitHub
From here, the visualization was completed with Cytoscape. The increased density of the Obese selection, when compared to the Lean selection suggests that the microbiome of Obese individuals varies much more widely than Lean samples. However, I'm no biologist.
Here are some other images from the visualization:





Statistics & Python Script


The following slides are a simple test case, designed to explain the basic functions of the script.

Running the script on the lean-obese data display that the degree for OTU nodes that are associated with "Obese samples only" significantly exceeds that of "Lean-only" samples. This translates to a greater diversity of the Obese sample microbial communities, of which are outside the core microbiome.


Category Min Q1 Mean Median Q3 Max StdDev
Lean 206.103 248.114 271.906 271.788 296.252 344.526 35.574
Obese 163.965 246.257 275.77 280.659 308.187 363.067 46.73
OTU_LeanOnly 1.0 1.00 1.435 1.01 1.809 4.185 0.759
OTU_ObeseOnly 1.0 1.003 2.732 1.644 3.016 16.614 3.154
OTU_Both 2.067 6.385 24.117 14.049 31.4 119.475 26.959

There is a wider distribution of "Obese" samples, when compared to those that are "Lean", which is representative of a more diverse gut microbiome. The core OTU nodes represent the band of shared OTUs.


OTUs associated with Lean-Only nodes have a low mean degree when compared to those of Obese-only nodes. The higher degree of Obese-only nodes supports the notion of deviations from the core microbiome being associated with physiological states; in this case, Obesity.


Help Text

Expand

...>python network_analysis.py -h
usage: network_analysis.py [-h] -node NODE_FILE -edge EDGE_FILE -f FEATURE -c
                           CATEGORIES CATEGORIES [-o OUTPUT_FILE]
                           [-n N_ITERATIONS] [-v] [--version]

        network_analysis.py; analyze statistics of degree comparing between two categories of feature column.
        Example:         network_analysis.py

            -node {PATH to NODE FILE}

            -edge {PATH to EDGE FILE}

            [-o {PATH to OUTPUT DIRECTORY}]

            -f {FEATURE COLUMN for comparison}

            -c {CATEEGORY of FEATURE} {CATEGORY of FEATURE}

            [-n {N_ITERATIONS for Monte Carlo Simulation}]


optional arguments:
  -h, --help            show this help message and exit
  -node NODE_FILE, --node_file NODE_FILE
                        path to an input node file, output from
                        make_otu_network.py
  -edge EDGE_FILE, --edge_file EDGE_FILE
                        path to an input edge file, output from
                        make_otu_network.py
  -f FEATURE, --feature FEATURE
                        Name of the feature column for analysis
  -c CATEGORIES CATEGORIES, --categories CATEGORIES CATEGORIES
                        Name of categories within the feature column for
                        analysis, two (2) required.
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        PATH to output data. Default:
                        .~\[feature]_network_analysis.txt
  -n N_ITERATIONS, --n_iterations N_ITERATIONS
                        Number of iterations for the analysis, will take
                        samples for n iterations. Default:1000
  -v, --verbose         display verbose output while program runs.
                        Default:True
  --version             display version number and exit

        This script will analyze statistics between two categories of a feature column in a node table.
        Returns output text file with statistics for the degree of each category, and otus associated with the
        respective categories, as well as both. Accuracy of the statistics can be controlled with n_iterations.

        Rationale
        ---------

        Comparing the degree of the different categories of a feature column can display a disparity of otu frequency
        in one category, or the other. This translates to a statistically significant difference between the microbial
        communities with respect to the categories analyzed.

        References
        ----------
        Qiime: http://qiime.org/
        Qiita: https://qiita.ucsd.edu/
        Gut Microbiome Dataset: https://qiita.ucsd.edu/study/description/77
        Biom-Format: http://biom-format.org/documentation/biom_format.html
        Cytoscape: http://www.cytoscape.org/documentation_users.html
        Make_otu_network.py: http://qiime.org/scripts/make_otu_network.html

        Notes
        ----------
        Given a BIOM and Mapping File, the following example can be used to generate the necessary node and edge files.
        Requires QIIME.

        Rarify the table for increased accuracy
            single_rarefaction.py -i 'sample.biom' -o 'sample_1000.biom' -d 1000

        Filter any samples that you are not analyzing. Here, we do not want 'Overweight' samples.
            filter_samples_from_otu_table.py
                -i 'sample_1000.biom' -m 'mapping_file.txt' -o 'sample1000_filtered.biom'
                --output_mapping_fp 'mapping_file_filtered.txt' -s 'obesitycat:*,!Overweight'

        Make the otu network with the filtered biom and mapping file, here we wanted properties based on "obesitycat"
            make_otu_network.py
                -i 'sample_1000_filtered.biom' -m mapping_file_filtered.txt -o otu_network_filtered  -b "obesitycat"

view raw help_text.md hosted with ❤ by GitHub

network_analysis.py

Expand

#!/usr/bin/env python
from __future__ import division
import pandas as pd
import os

__author__ = "Samuel L. Peoples"
__credits__ = ["Dr. Jesse Zaneveld"]
__version__ = "0.0.1"
__email__ = "contact@lukepeoples.com"
__status__ = "Development"

from argparse import ArgumentParser, RawDescriptionHelpFormatter, FileType
# Documentation can be found here:https://docs.python.org/2/library/argparse.html#module-argparse

def make_commandline_interface():
    """Returns a parser for the commandline"""
    short_description = \
        """
        network_analysis.py; analyze statistics of degree comparing between two categories of feature column.
        Example: \t network_analysis.py  \n\t\t
            -node {PATH to NODE FILE} \n\t\t
            -edge {PATH to EDGE FILE} \n\t\t
            [-o {PATH to OUTPUT DIRECTORY}] \n\t\t
            -f {FEATURE COLUMN for comparison} \n\t\t
            -c {CATEEGORY of FEATURE} {CATEGORY of FEATURE}\n\t\t
            [-n {N_ITERATIONS for Monte Carlo Simulation}]
        """

    long_description = \
        """
        This script will analyze statistics between two categories of a feature column in a node table.
        Returns output text file with statistics for the degree of each category, and otus associated with the 
        respective categories, as well as both. Accuracy of the statistics can be controlled with n_iterations.
    
        Rationale
        ---------
    
        Comparing the degree of the different categories of a feature column can display a disparity of otu frequency 
        in one category, or the other. This translates to a statistically significant difference between the microbial 
        communities with respect to the categories analyzed.
    
        References
        ----------
        Qiime: http://qiime.org/
        Qiita: https://qiita.ucsd.edu/
        Gut Microbiome Dataset: https://qiita.ucsd.edu/study/description/77
        Biom-Format: http://biom-format.org/documentation/biom_format.html
        Cytoscape: http://www.cytoscape.org/documentation_users.html
        Make_otu_network.py: http://qiime.org/scripts/make_otu_network.html 
        
        Notes
        ----------
        Given a BIOM and Mapping File, the following example can be used to generate the necessary node and edge files.
        Requires QIIME.
        
        Rarify the table for increased accuracy
            single_rarefaction.py -i 'sample.biom' -o 'sample_1000.biom' -d 1000
        
        Filter any samples that you are not analyzing. Here, we do not want 'Overweight' samples.
            filter_samples_from_otu_table.py 
                -i 'sample_1000.biom' -m 'mapping_file.txt' -o 'sample1000_filtered.biom' 
                --output_mapping_fp 'mapping_file_filtered.txt' -s 'obesitycat:*,!Overweight' 
        
        Make the otu network with the filtered biom and mapping file, here we wanted properties based on "obesitycat"
            make_otu_network.py 
                -i 'sample_1000_filtered.biom' -m mapping_file_filtered.txt -o otu_network_filtered  -b "obesitycat"
        """

    parser = ArgumentParser(description=short_description, \
                            epilog=long_description, formatter_class=RawDescriptionHelpFormatter)

    # Required parameters
    parser.add_argument('-node', '--node_file', type=str, required=True, \
                        help='PATH to an input NODE FILE, output from make_otu_network.py')

    parser.add_argument('-edge', '--edge_file', type=str, required=True, \
                        help='PATH to an input EDGE FILE, output from make_otu_network.py')

    parser.add_argument('-f', '--feature', type=str, required=True, \
                        help='Name of the FEATURE column for analysis')

    parser.add_argument('-c', '--categories', type=str, nargs=2, required=True, \
                        help='Name of CATEGORIES within the feature column for analysis, two (2) required.')
    # Optional parameters
    parser.add_argument('-o', '--output_file', type=str, default='',
                        help='PATH to output DIRECTORY. Default: .~\[feature]_network_analysis.txt')

    parser.add_argument('-n', '--n_iterations', type=int, default=1000, \
                        help="Number of iterations for the analysis, will take samples for n iterations. Default:%(default)s")

    # Example of a 'flag option' that sets a variable to true if provided
    parser.add_argument('-v', '--verbose', default=True, action='store_true', \
                        help="display verbose output while program runs. Default:%(default)s")

    # Add version information (from the __version__ string defined at top of script
    parser.add_argument('--version', action='version', version=__version__, \
                        help="display version number and exit")

    return parser


def parse_node_table(node_file, feature, categories, verbose):
    """
    Parses the node table's user_nodes degree and feature,
    returns separated DataFrames based on feature categories.
    :param node_file: filepath to node file
    :param feature: feature column for analysis
    :param categories: categories of feature column
    :param verbose: verbosity
    :return: DataFrame for each category containing node_disp_name, degree, and feature
    """
    if verbose:
        print("Parsing "+str(node_file))

    # Read the node file
    df = pd.read_csv(node_file, sep="\t")
    # Save just user nodes
    df = df[df.ntype == "user_node"]
    # Reduce the node file DataFrame
    df = df[["node_disp_name", "degree", feature]]
    # Separate the DataFrame into the two defined categories
    cat_0_table = df[df[feature] == categories[0]]
    cat_1_table = df[df[feature] == categories[1]]
    # Return the tables
    return cat_0_table, cat_1_table


def parse_otu_node_table(node_file, edge_file, feature,verbose):
    """
    Parses the otu nodes by joining the data in the edge file with the node file. Returns DataFrame for OTUs in each
    category respectively, and both categories.
    :param node_file: filepath to node file
    :param edge_file: filepath to edge file
    :param feature: feature column for analysis
    :param categories: categories of the feature column
    :param verbose: verbosity
    :return: DataFrame containing from, to, degree, and feature for each category respectively and both categories.
    """
    if verbose:
        print("Parsing "+str(edge_file))
    # Read the node file
    node_column_list = ["node_name", "degree", feature]
    df_node = pd.read_csv(node_file, sep="\t")
    df_node = df_node[df_node.node_name != "NaN"]
    df_node = df_node[node_column_list]

    # Read the edge file
    edge_column_list = ["from", "to", feature]
    df_edge = pd.read_csv(edge_file, sep="\t")
    df_edge = df_edge[edge_column_list]

    # Wrangling to join the degree, from, to and feature columns
    df_edge.rename(columns={'to': 'to'}, inplace=True)
    df_edge = df_edge.sort_values(by=['to'])
    df_edge['to'] = df_edge['to'].convert_objects(convert_numeric=True) # Doesn't work with to_numeric
    df_node.rename(columns = {'node_name':'to'},inplace=True)
    df_node = df_node.sort_values(by=['to'])
    df_node['to'] = df_node['to'].convert_objects(convert_numeric=True) # Doesn't work with to_numeric

    # Join the tables
    df_union = df_edge.merge(df_node,how='inner', on='to')
    df_union = df_union.drop([str(feature)+"_y"],axis=1)
    df_union.rename(columns = {(feature+'_x'):str(feature)},inplace=True)

    if verbose:
        print("\nUnioned DataFrame: ")
        print(df_union.head(n=10))
        print("\t ...")
    return df_union


def split_categories(df_union, categories, feature, verbose):
    """
    Splits the unioned DF into three DFs with unique OTU nodes; cat_0 only, cat_1, only, cat_both
    :param df_union: Joined dataframe
    :param categories: categories of feature
    :param feature: Feature column for testing
    :param verbose: verbostiy
    :return: cat_0_table, cat_1_table, cat_both_table
    """
    # List of otu node identifiers for comparison
    to_list = []
    cat_0_list = []
    cat_1_list = []

    # Create feature lists
    for row in df_union.iterrows():
        if row[1][2] == categories[0]:
            cat_0_list.append(row[1][1])
        elif row[1][2] == categories[1]:
            cat_1_list.append(row[1][1])

    # Strip the lists into cat_0 only, cat_1 only, and both
    set_b = set(cat_0_list) & set(cat_1_list)
    for cat_0 in cat_0_list:
        if cat_0 in set_b:
            to_list.append(cat_0)
    for cat_1 in cat_1_list:
        if cat_1 in set_b:
            if cat_1 not in to_list:
                to_list.append(cat_1)
    for item in to_list:
        if item in cat_0_list:
            cat_0_list.remove(item)
        elif item in cat_1_list:
            cat_1_list.remove(item)

    # Lists for the first category's DataFrame
    from_0 = []
    to_0 = []
    deg_0 = []
    feat_0 = []

    # Lists for the second category's DataFrame
    from_1 = []
    to_1 = []
    deg_1 = []
    feat_1 = []

    # Lists for the otus which appear in both categories
    from_b = []
    to_b = []
    deg_b = []
    feat_b = []

    u_to_list = []
    u_0_list = []
    u_1_list = []

    # Populate separated DataFrames, reduce tables to distinct OTU nodes;
    # Couldn't break into a separate function
    for row in df_union.iterrows():
        if row[1][1] in to_list:
            if row[1][1] not in u_to_list:
                u_to_list.append(row[1][1])
                from_b.append(row[1]['from'])
                to_b.append(row[1]['to'])
                deg_b.append(row[1][feature])
                feat_b.append(row[1]['degree'])
        elif row[1][1] in cat_0_list:
            if row[1][1] not in u_0_list:
                u_0_list.append(row[1][1])
                from_0.append(row[1]['from'])
                to_0.append(row[1]['to'])
                deg_0.append(row[1][feature])
                feat_0.append(row[1]['degree'])
        elif row[1][1] in cat_1_list:
            if row[1][1] not in u_1_list:
                u_1_list.append(row[1][1])
                from_1.append(row[1]['from'])
                to_1.append(row[1]['to'])
                deg_1.append(row[1][feature])
                feat_1.append(row[1]['degree'])

    # Create the first category's DataFrame
    cat_0_final = {"from": from_0, "to": to_0, feature: feat_0, "degree": deg_0}
    otu_0_table = pd.DataFrame(data=cat_0_final)
    otu_0_table.rename(columns={feature: 'degree', 'degree': feature}, inplace=True)

    # Create the second category's DataFrame
    cat_1_final = {"from": from_1, "to": to_1, feature: feat_1, "degree": deg_1}
    otu_1_table = pd.DataFrame(data=cat_1_final)
    otu_1_table.rename(columns={feature: 'degree', 'degree': feature}, inplace=True)

    # Create the DataFrame for otus which appear in both categories
    cat_both_final = {"from": from_b, "to": to_b, feature: feat_b, "degree": deg_b}
    otu_both_table = pd.DataFrame(data=cat_both_final)
    otu_both_table.rename(columns={feature: 'degree', 'degree': feature}, inplace=True)

    if verbose:
        print(categories[0] + " Only:")
        print(otu_0_table.head(n=10))
        print("\t\t ...")
        print(categories[1] + " Only:")
        print(otu_1_table.head(n=10))
        print("\t\t\t ...")
        print("Both " + categories[0] + " and " + categories[1] + ":")
        print(otu_both_table.head(n=10))
        print("\t\t\t\t ...")
    return otu_0_table, otu_1_table, otu_both_table


def parse_stats(feature, categories, cat_0_table, cat_1_table, otu_0_table, otu_1_table, otu_both_table, n_iterations, output_file, verbose):
    """
    Parse the statistics for each DataFrame by averaging n_iterations of random samples. Finds Min, Q1, Mean,
    Median, Q3, Max, and Standard Deviation over the iterations.
    :param feature: feature column for analysis
    :param categories: categories of the feature column
    :param cat_0_table: user_node degree DataFrame for the first category
    :param cat_1_table: user_node degree DataFrame for the second category
    :param otu_0_table: otu_node degree DataFrame which is associated with the first category only.
    :param otu_1_table: otu_node degree DataFrame which is associated with the second category only.
    :param otu_both_table: otu_node degree DataFrame which is associated with both categories.
    :param n_iterations: number of iterations for the analysis
    :param output_file: output file location, appends with the feature category and network_analysis.txt
        ex: C:/.../data/output/feature_network_analysis.txt
    :param verbose: verbosity
    """
    v_string = "Processing statistics for "+categories[0]+" nodes, for "+str(n_iterations)+" iterations, with samples of 40."
    # Parse the stats for the first category
    stats_0 = individual_stats(cat_0_table, n_iterations, verbose, v_string)

    v_string = "Processing statistics for "+categories[1]+" nodes, for "+str(n_iterations)+" iterations, with samples of 40."
    # Parse the stats for the second category
    stats_1 = individual_stats(cat_1_table, n_iterations, verbose, v_string)

    v_string = "Processing statistics for otu nodes connected to " + categories[0] + " only, for " + str(
        n_iterations) + " iterations, with samples of 40."
    # Parse the stats for the otus associated with the first category only
    stats_otu_0 = individual_stats(otu_0_table, n_iterations, verbose, v_string)

    v_string = "Processing statistics for otu nodes connected to " + categories[1] + " only, for " + str(
            n_iterations) + " iterations, with samples of 40."
    # Parse the stats for the otus associated with the second category only
    stats_otu_1 = individual_stats(otu_1_table, n_iterations, verbose, v_string)

    v_string = "Processing statistics for otu nodes connected to both " + categories[0] + " and " + categories[
            1] + ", for " + str(n_iterations) + " iterations, with samples of 40."
    # Parse the stats for the otus associated with both categories
    stats_otu_b = individual_stats(otu_both_table, n_iterations, verbose, v_string)

    # Save the stats to the output file location
    outfile = open(output_file+"/"+str(feature)+"_network_analysis.txt",'w')
    outfile.write(categories[0] + ":\n" + stats_0+"\n")
    outfile.write(categories[1] + ":\n" + stats_1+"\n")
    outfile.write(categories[0] + "Only :\n" + stats_otu_0+"\n")
    outfile.write(categories[1] + "Only :\n" + stats_otu_1+"\n")
    outfile.write("Both " + categories[0] + " and " + categories[1] + ":\n" + stats_otu_b+"\n")
    outfile.close()

    #Print the stats
    if verbose:
        print("Statistics:")
        print(categories[0] + ":\n" + stats_0)
        print(categories[1] + ":\n" + stats_1)
        print(categories[0] + "Only :\n" + stats_otu_0)
        print(categories[1] + "Only :\n" + stats_otu_1)
        print("Both "+ categories[0] + " and " + categories[1] + ":\n" + stats_otu_b)
        print("Output saved to: "+output_file+"/"+str(feature)+"_network_analysis.txt")


def individual_stats(table, n_iterations, verbose, v_string):
    """
    Individual stats for each table passed in
    :param table: DataFrame with column labeled 'degree'
    :param n_iterations: number of iterations for the test
    :return: string containing statistics; Min, Q1, Mean, Median, Q3, Max, Std_Dev
    """
    if verbose:
        print(v_string)

    # Define lists for the stats
    minimum = []
    q1 = []
    mean_val = []
    median_val = []
    q3 = []
    maximum = []
    std_dev = []

    # Save the original table for sampling
    orig = table
    for i in range(n_iterations):
        # Take a sample
        table = orig.sample(n=40,replace=True)
        # Append the lists
        minimum.append(table.degree.min())
        q1.append(table.degree.quantile(.25))
        mean_val.append(table.degree.mean())
        median_val.append(table.degree.quantile(.5))
        q3.append(table.degree.quantile(.75))
        maximum.append(table.degree.max())
        std_dev.append(table.degree.std())

    # Create a DataFrame of Stats
    d = {'minimum': minimum, 'q1': q1, 'mean_val': mean_val, 'median_val': median_val, 'q3': q3, 'maximum': maximum,
           'std_dev': std_dev}
    df = pd.DataFrame(data=d)

    # Build the stats string to return
    stats = ("\t Min: " + str(round(df.minimum.mean(), 3))
          + "\t 1Q: " + str(round(df.q1.mean(), 3))
          + "\t Mean: " + str(round(df.mean_val.mean(), 3))
          + "\t Median: " + str(round(df.median_val.mean(), 3))
          + "\t 3Q: " + str(round(df.q3.mean(), 3))
          + "\t Max: " + str(round(df.maximum.mean(), 3))
          + "\t Std: " + str(round(df.std_dev.mean(), 3)))
    return stats


def main():
    """Main function"""
    parser = make_commandline_interface()
    args = parser.parse_args()

    node_file = args.node_file
    if not os.path.isfile(node_file):
        print(node_file+" not found. Please verify location.")
        exit(0)
    edge_file = args.edge_file
    if not os.path.isfile(edge_file):
        print(edge_file+" not found. Please verify location.")
        exit(0)
    output_file = args.output_file
    if not os.path.isdir(output_file):
        print(node_file+" not found. Please verify location.")
        exit(0)

    feature = args.feature
    categories = args.categories
    n_iterations = args.n_iterations
    verbose = args.verbose

    if verbose:
        print("network_analysis.py")
        print("\t Node file:", node_file)
        print("\t Edge file:", edge_file)
        print("\t Output filepath:", output_file)
        print("\t Feature: ", feature)
        print("\t Categories: ", categories)
        print("\t n_iterations: ", n_iterations)

    cat_0_table, cat_1_table = parse_node_table(node_file, feature, categories, verbose)
    df_union = parse_otu_node_table(node_file, edge_file, feature, verbose)
    otu_0_table, otu_1_table, otu_both_table = split_categories(df_union, categories, feature, verbose)

    parse_stats(feature, categories, cat_0_table, cat_1_table, otu_0_table,
                otu_1_table, otu_both_table, n_iterations, output_file, verbose)


if __name__ == "__main__":
    main()


Console Output

Expand

...>python network_analysis.py -node "~./data/filtered_data/otu_network_filtered/real_node_table.txt" -edge "~./data/filtered_data/otu_network_filtered/real_edge_table.txt" -o "~./data/results" -f "obesitycat" -c "Lean" "Obese"
network_analysis.py
         Node file: ~./data/filtered_data/otu_network_filtered/real_node_table.txt
         Edge file: ~./data/filtered_data/otu_network_filtered/real_edge_table.txt
         Output filepath: ~./data/results
         Feature:  obesitycat
         Categories:  ['Lean', 'Obese']
         n_iterations:  1000
Parsing ~./data/filtered_data/otu_network_filtered/real_node_table.txt
Parsing ~./data/filtered_data/otu_network_filtered/real_edge_table.txt

Unioned DataFrame:
         from     to obesitycat  degree
0    77.TS134  12727      Obese       2
1  77.TS126.2  12727      Obese       2
2     77.TS19  13986      Obese       3
3    77.TS127  13986       Lean       3
4     77.TS66  13986      Obese       3
5    77.TS2.2  15728       Lean      43
6  77.TS134.2  15728      Obese      43
7   77.TS27.2  15728      Obese      43
8   77.TS39.2  15728      Obese      43
9    77.TS124  15728       Lean      43
         ...
Lean Only:
  obesitycat        from  degree      to
0       Lean  77.TS185.2       1   16477
1       Lean    77.TS4.2       1   24162
2       Lean  77.TS155.2       1   32546
3       Lean  77.TS165.2       1   34789
4       Lean     77.TS13       1   70632
5       Lean    77.TS129       1  109587
6       Lean  77.TS109.2       1  110059
7       Lean     77.TS25       2  113278
8       Lean      77.TS2       1  113827
9       Lean   77.TS30.2       1  113919
                 ...
Obese Only:
  obesitycat        from  degree     to
0      Obese    77.TS134       2  12727
1      Obese  77.TS119.2       2  24546
2      Obese  77.TS118.2       3  25534
3      Obese    77.TS190       1  28218
4      Obese    77.TS156       1  29566
5      Obese  77.TS169.2       1  33112
6      Obese     77.TS21       1  34139
7      Obese     77.TS87       1  35260
8      Obese     77.TS43      10  36330
9      Obese    77.TS169       3  36378
                         ...
Both Lean and Obese:
  obesitycat        from  degree     to
0      Obese     77.TS19       3  13986
1       Lean    77.TS2.2      43  15728
2      Obese  77.TS116.2      39  16054
3       Lean    77.TS195       2  16340
4      Obese   77.TS94.2       4  17311
5      Obese   77.TS70.2      12  19611
6      Obese     77.TS74       5  31249
7      Obese    77.TS133      19  48084
8       Lean   77.TS13.2       4  49088
9      Obese   77.TS67.2      14  52624
                                 ...
Processing statistics for Lean nodes, for 1000 iterations, with samples of 40.
Processing statistics for Obese nodes, for 1000 iterations, with samples of 40.
Processing statistics for otu nodes connected to Lean only, for 1000 iterations, with samples of 40.
Processing statistics for otu nodes connected to Obese only, for 1000 iterations, with samples of 40.
Processing statistics for otu nodes connected to both Lean and Obese, for 1000 iterations, with samples of 40.
Statistics:
Lean:
         Min: 206.103    1Q: 248.114     Mean: 271.906   Median: 271.788         3Q: 296.252     Max: 344.526    Std: 35.574
Obese:
         Min: 163.965    1Q: 246.257     Mean: 275.77    Median: 280.659         3Q: 308.187     Max: 363.067    Std: 46.73
LeanOnly :
         Min: 1.0        1Q: 1.0         Mean: 1.435     Median: 1.01    3Q: 1.809       Max: 4.185  Std: 0.759
ObeseOnly :
         Min: 1.0        1Q: 1.003       Mean: 2.732     Median: 1.644   3Q: 3.016       Max: 16.614     Std: 3.154
Both Lean and Obese:
         Min: 2.067      1Q: 6.385       Mean: 24.117    Median: 14.049  3Q: 31.4        Max: 119.475    Std: 26.959
Output saved to: ~./data/results/obesitycat_network_analysis.txt