Examining Exoplanets for Habitability

Fall 2019: Terrence Parks, Jeffrey Robbins, Gavin Hanley

space image from nasa.gov

Getting Started

Exoplanets are defined as any planet that exists outside of our solar system. The first exoplanet was discovered in 1995, with thousands of confirmed exoplanets to follow. With an estimated average of at least one exoplanet orbiting every star in the galaxy, this means there are trillions of exoplanets in our galaxy alone. Such a large quantity of exoplanets suggests that it is highly unlikely for earth to be the only planet able to support forms of life.

The objective of this project is to analyze exoplanet data obtained from NASA to determine exoplanets that have potential for habitability.

This project makes use of Python 3 as well as several imported libraries:

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn as sk
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn import preprocessing
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.decomposition import PCA
from scipy.stats import pearsonr

Data Collection

The first step of this project is to collect the exoplanet data to be used for analysis. The data is aquired from a tab-delimited text file found on the NASA exoplanet archive.

Initially, the data is read using the read_csv function defined in the pandas library. This function parses the text file, using tabs within the text file to separate data entries. There are 70 different columns in the text file, many of which are not of any use to our analysis. To only add the columns that are useful for our analysis, we use the usecols parameter within the read_csv function, which takes an array of existing columns in the dataset and only adds those columns to the dataframe.

In [2]:
# create pandas dataframe using read_csv to parse text file using tabs as separator
# usecols is used to select only the desired parameters relevant to our analysis
exoplanet_df = pd.read_csv("planet_data.txt", sep = '\t', comment = '#', usecols = ['pl_name', 
               'pl_orbper', 'pl_orbsmax', 'pl_orbeccen', 'pl_orbincl', 'pl_bmassj', 'pl_radj', 'pl_dens', 'st_dist',
               'st_optmag', 'st_teff', 'st_mass', 'st_rad', 'rowupdate', 'pl_facility'])

Data Processing

After adding the desired columns to the dataframe, we will rename the columns to be more readable and meaningful. This will help when trying to remember what each column represents. This can be done by acessing the columns attribute of the dataframe and simply setting that equal to an array of strings. These strings represent the new names of the columns. It is important to match the new column names in the correct order of the existing dataframe, otherwise we will have a misrepresentation of the data.

In [3]:
# create new column names for existing columns to be more readable
exoplanet_df.columns = ['planet_name', 'orbital_period', 'orbit_semi_major_axis',
                        'eccentricity', 'inclination', 'planet_mass', 'planet_radius', 'planet_density', 
                       'distance', 'optical_magnitude', 'effective_temperature', 'stellar_mass', 'stellar_radius',
                       'date_of_last_update', 'discovery_facility']

Next, we want to make our dataframe ready for analysis. Part of our analysis will involve comparing each exoplanet's mass and radius with Earth's mass and radius in order to determine how similar the respective exoplanet is to Earth. However, the NASA data gives each exoplanet's mass in relation to Jupiter's mass. We will convert each exoplanet's mass from being relative to Juipter's mass to being relative to Earth's mass. To do this, we simply multiply the mass and radius of Earth relative to Jupiter.

In [4]:
# convert planet mass and radius from being relative to Jupiter to  being relative to Earth by 
# multiplying by the respective conversion rate
for index, row in exoplanet_df.iterrows():
    exoplanet_df.loc[index, 'planet_mass'] = row['planet_mass'] * 317.83
    exoplanet_df.loc[index, 'planet_radius'] = row['planet_radius'] * 11.2089
    
exoplanet_df.head()
Out[4]:
planet_name orbital_period orbit_semi_major_axis eccentricity inclination planet_mass planet_radius planet_density distance optical_magnitude effective_temperature stellar_mass stellar_radius date_of_last_update discovery_facility
0 11 Com b 326.03000 1.29 0.231 NaN 6165.9020 NaN NaN 93.37 4.740 4742.0 2.70 19.00 2014-05-14 Xinglong Station
1 11 UMi b 516.21997 1.53 0.080 NaN 4684.8142 NaN NaN 125.72 5.016 4213.0 2.78 29.79 2018-09-06 Thueringer Landessternwarte Tautenburg
2 14 And b 185.84000 0.83 0.000 NaN 1525.5840 NaN NaN 75.59 5.227 4813.0 2.20 11.00 2014-05-14 Okayama Astrophysical Observatory
3 14 Her b 1773.40002 2.93 0.370 NaN 1481.0878 NaN NaN 17.94 6.610 5338.0 0.90 0.93 2018-09-06 W. M. Keck Observatory
4 16 Cyg B b 798.50000 1.66 0.680 NaN 565.7374 NaN NaN 21.15 6.250 5750.0 1.08 1.13 2018-09-06 Multiple Observatories

Exploratory Analysis

After collecting and tidying the exoplanet dataset, we will begin to explore relationships that exist between different characteristics of each exoplanet.

To start, we want to get a general sense of the size of each exoplanet in relation to earth. The characteristics we will use to make this observation are the exoplanet's radius, mass, and density.

The plot below shows the relationship between exoplanet radius and exoplanet mass. The units for radius are in relation to Earth's radius, and the units for mass are in relation to Earth's mass.

In [5]:
plt.figure(figsize = (14,8)) # configure size of plot
plt.xlim(0, 30)  # set x and y limits
plt.ylim(0,1500)
# use seaborn regplot to produce a scatter plot with line of best fit
sns.regplot(x = exoplanet_df['planet_radius'], y = exoplanet_df['planet_mass'])
plt.xlabel('Exoplanet Radius (Number of Earth Radii)') # set labels for axes
plt.ylabel('Exoplanet Mass (Number of Earth Masses)')
plt.grid() # show grid
plt.title('Exoplanet Mass vs Exoplanet Radius') # create plot title
plt.show()

As we would assume, the mass of an exoplanet increases as the radius increases.

The next plot will look at the relationship between exoplanet radius and exoplanet density. The units for radius are in relation to Earth's radius, and the units for density are g/cm3

In [6]:
plt.figure(figsize = (14,8)) # set plot size
plt.ylim(0,8) # set y limit
# use seaborn regplot to create a scatter plot with line of best fit
sns.regplot(x = exoplanet_df['planet_radius'], y = exoplanet_df['planet_density'])
plt.title('Exoplanet Density vs Exoplanet Radius') # plot title
plt.xlabel('Exoplanet Radius (Earth Radii)') # x and y axis labels
plt.ylabel('Exoplanet Density (g/cm**3)')
plt.grid()  # show grid
plt.show()

Interestingly enough, the exoplanet density decreases as radius increases. This could be due to the fact that many exoplanets with a large radius are likely to be gaseous planets, which are much less dense than terrestral planets.

We will also analyze the relationship between each exoplanet's orbital semi-major axis and the effective temperature. The orbital semi-major axis represents the distance from a given exoplanet to its respective sun in astronomical units. An astronomical unit is the mean distance from the center of earth to the center of the sun. The effective temperature represents the temperature of an exoplanet's sun in units of Kelvin.

In [7]:
# configure plot size, x and y limits
plt.figure(figsize = (14,10))
plt.xlim(0, 8)
plt.ylim(3000,8000)
# create horizontal line which represents mean effective temperature
plt.axhline(y = exoplanet_df['effective_temperature'].mean(), color='r', linestyle='--')
# create scatterplot using seaborn with orbital semi-major axis on x axis and temperature on y axis
sns.scatterplot(x = exoplanet_df['orbit_semi_major_axis'], y = exoplanet_df['effective_temperature'])
plt.grid()
plt.xlabel('Orbital Semi-Major Axis (AU)')
plt.ylabel('Effective Temperature (Kelvin)')
plt.show()

It appears that the effective temperature of an exoplanet's sun converges to the mean effective temperature as the orbital semi-major axis increases.

Analysis, Hypothesis Testing & Machine Learning

After visualizing our dataframe, we will now begin to determine which exoplanets are most likely to be habitable to life forms.

Below, we define three functions which are used to determine if a given exoplanet lies within a range of stellar radius and effective temperature. The stellar radius is defined as the distance from the center of a star to the surface, measured in units of radius of the (earth's) sun. The effective temperature is defined as the temperature of the star in Kelvin. Essentially we are determining whether each exoplanet is within the habitable zone for the exoplanet's sun based on this formula.

In [8]:
# function to determine if a exoplanet lies within the inner radius of a habitable zone
def radInner(i):
    return 0.7*exoplanet_df['stellar_radius'].iloc[i]*pow(exoplanet_df['effective_temperature'].iloc[i] / 5777,2)
# function to determine if an exoplanet lies within the outter radius of a habitable zone
def radOuter(i):
    return 1.5*exoplanet_df['stellar_radius'].iloc[i]*pow(exoplanet_df['effective_temperature'].iloc[i] / 5777,2)
# function to determine if a planet is habitable by checking if it lies above the inner radius and below the
# outer radius
def habitable(i):
    if exoplanet_df['orbit_semi_major_axis'].iloc[i] >= radInner(i) and exoplanet_df['orbit_semi_major_axis'].iloc[i] <= radOuter(i):
        return True
    else:
        return False

After defining the habitable functions above, we then want to determine which exoplanets fit the defined criteria. We do this by iterating over the exoplanet dataframe, checking if the current exoplanet is habitable by use of our functions defined above, and if so then adding that exoplanet to a new dataframe, habitableZone. This habitableZone dataframe will be used for further ananlysis only concerning exoplanets that are deemed to potentially be habitable.

In [9]:
# create habitable zone dataframe which will be used to store exoplanets that have potential to be habitable
habitableZone_df = exoplanet_df[0:0]

# iterate over existing exoplanet dataframe, checking if each exoplanet is in the range of values for determining
# habitability by use of the functions defined above
for i in range(len(exoplanet_df)):
    if habitable(i):
        habitableZone_df = habitableZone_df.append(exoplanet_df.iloc[i])
        
# reset index is used to readjust index values to be consecutive 
habitableZone_df = habitableZone_df.reset_index()
habitableZone_df = habitableZone_df.drop(columns='index')
habitableZone_df.head()
Out[9]:
planet_name orbital_period orbit_semi_major_axis eccentricity inclination planet_mass planet_radius planet_density distance optical_magnitude effective_temperature stellar_mass stellar_radius date_of_last_update discovery_facility
0 16 Cyg B b 798.50 1.660 0.680 NaN 565.73740 NaN NaN 21.15 6.25 5750.0 1.08 1.13 2018-09-06 Multiple Observatories
1 55 Cnc f 262.00 0.788 0.305 90.0 44.81403 NaN NaN 12.59 5.96 5196.0 0.91 0.94 2014-10-29 Multiple Observatories
2 BD+14 4559 b 268.94 0.780 0.290 NaN 330.54320 NaN NaN 49.42 9.63 4864.0 0.49 0.86 2018-09-06 McDonald Observatory
3 BD+49 828 b 2590.00 4.200 0.350 NaN 508.52800 NaN NaN 447.59 9.38 4943.0 1.52 7.60 2015-02-05 McDonald Observatory
4 BD-08 2823 c 237.60 0.680 0.190 NaN 104.88390 NaN NaN 41.38 9.88 4816.0 0.50 0.71 2014-05-14 La Silla Observatory

Below we calculate the number of confirmed exoplanets and the number of potentially habitable exoplanets and plot this information using a lollipop plot.

In [10]:
# creates a dataframe containg the count of confirmed planets and potentially habitable exoplanets
count_df = pd.DataFrame({'status': ['Potentially Habitable (124)', 'Confirmed (4104)'], 'count': [len(habitableZone_df),len(exoplanet_df)]})

# creates a horizontal lollipop plot of confirmed planets and potentially habitable exoplanets
plt.figure(figsize=(12,1.5))
plt.ylim(-1, 2)
plt.hlines(y=count_df['status'], xmin=0, xmax=count_df['count'], color='skyblue')
plt.plot(count_df['count'], count_df['status'], "o")
 
# add titles and axis names
plt.title("Potentially Habitable Exoplanets Relative to Confirmed Exoplanets", loc='center', pad=30)
plt.xlabel('Number of Exoplanets')
plt.show()

The current number of confirmed exoplanets that have been discovered is 4104. Finding habitable planets from this set is very lucrative. The first step of finding a habitable exoplanet is to check if the exoplanet falls within it’s systems habitable zone. This calculation drastically reduces the number of exoplanets to 124 potentially habitable ones.

In order to better visualize how exoplanet's and their habitable zones relate, we need to normalize. As seen below, we calculate this by finding the range of the habitable zone for each planet and setting the lower bound to 0 and upper bound to 1. We then proceed to calculate them exoplanets relative location to these bounds. Exoplanets between the values of 0 and 1 are habitable. Exoplanets below the value of 0 are too hot to be habitable and above 1 are too cold to be habitable.

In [11]:
zone_location = []

# iterate over the exoplanet dataframe
for i in range(len(exoplanet_df)):
    # sets the planets location relative to the bounds of its habitable zone
    if habitable(i):
        # exoplanet lies between habitable zone bounds (habitable)
        zone = radOuter(i) - radInner(i)
        zone_location.append(exoplanet_df['orbit_semi_major_axis'].iloc[i] / zone)
    else:
        if exoplanet_df['orbit_semi_major_axis'].iloc[i] < radInner(i):
            # exoplanet lies below the inner radius (unhabitable)
            outer_diff = abs(exoplanet_df['orbit_semi_major_axis'].iloc[i] - radInner(i))
            zone_location.append(exoplanet_df['orbit_semi_major_axis'].iloc[i] - outer_diff)
        else:
            # exoplanet lies above the outer radius (unhabitable)
            outer_diff = abs(exoplanet_df['orbit_semi_major_axis'].iloc[i] - radOuter(i))
            zone_location.append(radOuter(i) + outer_diff)

# adds the zone location of each exoplanet to the exoplanet dataframe
exoplanet_df['zone_location'] = zone_location

# obtain x, y, z
y = exoplanet_df['effective_temperature']
x = exoplanet_df['zone_location']
# radius will control the size of points
z = exoplanet_df['planet_radius']
 
# Change color with c and alpha. I map the color to the X axis value.
plt.figure(figsize=(20,10))
plt.scatter(x, y, s=z*80, c=x, cmap="YlOrRd", alpha=0.8, edgecolors="grey", linewidth=2, vmin=-1, vmax=2)
plt.ylim(2000, 7000)
plt.xlim(-1, 2)

# Add titles (main and on axis)
plt.xlabel("Location Of Planets Relative To Their Habitable Zone (Bounded 0 to 1)", fontsize = 12)
plt.ylabel("Effective Temperature (K)", fontsize = 12)
plt.title("Temperature Of Host Star vs. Location Of Planet Relative To Their Habitable Zone", fontsize = 14)
plt.show()

From the scatter plot above, we can see a large number of the exoplanets in our data set rest between their host star and inner boundary of their system's habitable zone. This means that a large swath of them are too hot to be habitable. We also see an interesting trend in the same part of the plot. The host star's temperature decreases as the location of the exoplanet approaches the inner bound of their habitable zone.

Here we are looking at the number of habitable exoplanets discovered by each discovery facility.

In [12]:
# facilities dictionary stores each discovery facility as keys and how many exoplanet discoveries as values
facilities = {}
# iterate over habitable zones dataframe, either adding a new entry or update existing entry in dictionary
for index, row in habitableZone_df.iterrows():
    if facilities.get(row['discovery_facility']) == None:
        facilities[row['discovery_facility']] = 1
    else:
        facilities[row['discovery_facility']] += 1

# use seaborn to plot number of habitable exoplanets discovered for each discovery facility
plt.figure(figsize = (14,10))
plt.title('Number of Potentially Habitable Exoplanets for Each Discovery Facility', fontsize = 14)
plt.xlabel('Discovery Facility', fontsize = 12)
plt.ylabel('Number of Potentially Habitable Exoplanet Discoveries', fontsize = 12)
plt.xticks(range(len(list(facilities.keys()))), rotation = -30, ha = 'left')
sns.barplot(x = list(facilities.keys()), y = list(facilities.values()))
plt.show()

Kepler, La Silla, and W.M. Keck discovery facilities are leading the way in potentially habitable exoplanet discoveries, with over 20 discovered exoplanets each.

Stellar Classification

We will now group each exoplanet's star based on the star's effective temperature. We will use the Morgan-Keenan System to designate a star's spectral type. We will then use these spectral types to determine which types are found among those exoplanets deemed to be potentially habitable.

In [13]:
# spectral color and class store the respective color and class of each exoplanet based on temperature
spectral_class = []
spectral_color = []

# iterate over potentially habitable exoplanets, assigning spectral classes and colors for each planet
for i in range(len(exoplanet_df)):
    temp = exoplanet_df['effective_temperature'].iloc[i]
    if temp > 30000:
        spectral_class.append("O")
        spectral_color.append("Blue")
    elif temp > 10000 and temp <= 30000:
        spectral_class.append("B")
        spectral_color.append("Blue White")
    elif temp > 7500 and temp <= 10000:
        spectral_class.append("A")
        spectral_color.append("White")
    elif temp > 6000 and temp <= 7500:
        spectral_class.append("F")
        spectral_color.append("Yellow White")
    elif temp > 5200 and temp <= 6000:
        spectral_class.append("G")
        spectral_color.append("Yellow")
    elif temp > 3700 and temp <= 5200:
        spectral_class.append("K")
        spectral_color.append("Orange")
    elif temp > 2400 and temp <= 3700:
        spectral_class.append("M")
        spectral_color.append("Red")
    else:
        spectral_class.append(np.nan)
        spectral_color.append(np.nan)
        
# create new columns in habitable zones dataframe to store the class and color of each exoplanet
exoplanet_df['spectral_class'] = spectral_class
exoplanet_df['spectral_color'] = spectral_color

# recreate habitable zone dataframe
habitableZone_df = exoplanet_df[0:0]
for i in range(len(exoplanet_df)):
    if habitable(i):
        habitableZone_df = habitableZone_df.append(exoplanet_df.iloc[i])
habitableZone_df = habitableZone_df.reset_index()
habitableZone_df = habitableZone_df.drop(columns='index')

# create an empty dictionary that will store the stellar types for each potentially habitable exoplanet
spec = {}

for index, row in habitableZone_df.iterrows():
    if spec.get(row['spectral_class']) == None:  # if type is not defined in dictionary
        spec[row['spectral_class']] = 1
    else:
        spec[row['spectral_class']] += 1

plt.figure(figsize = (10,7))
plt.title('Spectral Classification of Host Star of Potentially Habitable Exoplanets', fontsize = 14, pad=30)
plt.xlabel('Host Star Spectral Classification', fontsize = 12)
plt.ylabel('Number of Potentially Habitable Exoplanets', fontsize = 12)
plt.stem(list(spec.keys()), list(spec.values()),use_line_collection=True)
plt.show()

Based on the chart, it is apparent that most of the potentially habitable exoplanets lie within the 'G' class of spectral classification. This is promising considering that Earth is within the 'G' class of spectral classification, suggesting that most of the potentially habitable exoplanets share the same spectral classification with Earth.

Exoplanet Size Classification

We will now classify the exoplanets into groups based on mass. To do this, we iterate over the habitable zones dataframe, extract the planetary mass relative to Earth, and decide which class each exoplanet belongs to. The bounds used to determine which class an exoplanet belongs to were obtained from the Planetary Habitability Laboratory.

In [14]:
# size class array determines class of exoplanet based on mass
size_class = []
size_class_num = []

# iterate over habitable zones dataframe, determining the class of each planet based on mass
for i in range(len(exoplanet_df)):
    mass = exoplanet_df['planet_mass'].iloc[i]
    if mass > 0 and mass < 0.00001:
        size_class.append("Asterodian")
        size_class_num.append(0)
    elif mass >= 0.00001 and mass < 0.1:
        size_class.append("Mercurian")
        size_class_num.append(0)
    elif  mass >= 0.1 and mass < 0.5:
        size_class.append("Subterran")
        size_class_num.append(1)
    elif mass >= 0.5 and mass < 2:
        size_class.append("Terran")
        size_class_num.append(2)
    elif  mass >= 2 and mass < 10:
        size_class.append("Superterran")
        size_class_num.append(3)
    elif  mass >= 10 and mass < 50:
        size_class.append("Neptunian")
        size_class_num.append(4)
    elif  mass >= 50 and mass < 5000:
        size_class.append("Jovian")
        size_class_num.append(0)
    else:
        size_class.append(np.nan)
        size_class_num.append(0)
        
# append size class array to habitable zones dataframe
exoplanet_df['size_class'] = size_class
exoplanet_df['size_class_num'] = size_class_num

# recreate habitable zone dataframe
habitableZone_df = exoplanet_df[0:0]
for i in range(len(exoplanet_df)):
    if habitable(i):
        habitableZone_df = habitableZone_df.append(exoplanet_df.iloc[i])
habitableZone_df = habitableZone_df.reset_index()
habitableZone_df = habitableZone_df.drop(columns='index')

# create size dictionary to hold the count of each class of planets based on mass
size = {}

# sum up the number of exoplanets that do no have a reported mass thus do not belong to a class 
size['Unclassified'] = habitableZone_df['size_class'].isnull().sum()

# iterate over each exoplanet, adding to the quantity of planets for each class
for index, row in habitableZone_df.iterrows():
    if size.get(row['size_class']) == None:
        size[row['size_class']] = 1
    else:
        size[row['size_class']] += 1

plt.figure(figsize = (10,7))
plt.title('Size Classification of Potentially Habitable Exoplanets', fontsize = 14)
plt.xlabel('Number of Potentially Habitable Exoplanets', fontsize = 12)
plt.ylabel('Size Classification', fontsize = 12)
plt.yticks(range(len(list(size.keys()))))
sns.barplot(y = list(size.keys()), x = list(size.values()))
plt.show()

Based on the bar chart, the majority of potentially habitable exoplanets are in the Jovian size class. Jovian planets are huge compared to Earth, being somewhere between 50 to 5000 times the mass of Earth.

Correlation

We are now going to attempt to find a relationship between the size class of the planet and one of our numerical variables. To find a relationship we are first going to explore some box plots to see if further analysis is needed. Correlation shows us how related two variables are, you can read more about it here. In this analysis we are only considering terran like planets since these are the only ones that are habitable.

In [15]:
# Making a subset dataframe with just planet name, the size class, and optical magnitude.
adjusted_df = exoplanet_df[['planet_name', 'size_class', 'optical_magnitude', 'eccentricity', 'size_class_num']]


# drops jovian and neptunian size planets because they are currently known to not be able to host life
adjusted_df = adjusted_df.drop(adjusted_df.index[adjusted_df.size_class == 'Jovian'])
adjusted_df = adjusted_df.drop(adjusted_df.index[adjusted_df.size_class == 'Neptunian'])

# Dropping values that have NaN. There is no good way to impute data since the planets variation is so large.
adjusted_df = adjusted_df.dropna()
adjusted_df = adjusted_df.reindex()

# Creates box plot with size class on the x-axis and the optical magnitude on the y-axis.
plt.figure(figsize=(14,8))
ax = sns.boxplot(x=adjusted_df.size_class, y=adjusted_df.optical_magnitude, palette="Set3")
plt.show()

From this box plot them seems to be a significant relationship between the numerical variable optical magnitude and the planet class size. The mean values between the different categories differ enough to further analyze if there is a significant correlation between the optical magnitude and planet class size.

In [16]:
# Creates box plot with size class on the x-axis and the optical magnitude on the y-axis.
plt.figure(figsize=(14,8))
ax = sns.boxplot(x=adjusted_df.size_class, y=adjusted_df.eccentricity, palette="Set3")
plt.show()

The box plot above shows the relationship between the class size and the numerical variable eccentricity. The Superterran planet class has a significant number of outliers indicted by the dots above the whisker. Outliers tend to skew the data and make the results of many analysis invalid. As you can see the box plot filtered them out. Looking at the means of the eccentricity between the categories they are very closes to each other and differ by a small amount. This indicates that eccentricity is independent from the class size. Based on this fact and the amount of outliers Superterran has we should not further analyze this relationship. These are the only two numerical variables we will make a box plot for since there are significant outliers for the rest of the variables which make the boxes unviewable and would lead to an unreliable statistical analysis.

Now we are going to further analyze the relationship between planet class size and optical magnitude. Below we are going to use a SciPy library to output some information that is needed for this analysis.

In [17]:
r = pearsonr(adjusted_df.size_class_num.values, adjusted_df.optical_magnitude.values)
print("Correlation Coefficient:", r[0], "P-value:", r[1])
Correlation Coefficient: -0.26740774362692227 P-value: 0.00037555037732321433

The correlation coefficient is approximately %26.7 which indicates a weak positive correlation between the optical magnitude and planet class size. The p-value is approximately 0.00 which means our result is statistically significant, meaning that there is evidence based on this sample that the correlation between optical magnitude and planet class size is non-zero. Based on the fact that out subterran sample is smaller then the rest I believe that we would need more subterran samples in order to get a higher and more reliable correlation coefficient. Out of all the numerical variables optical magnitude is the only variables that shows any significant correlation between the planet class. This is good information for when looking at habital planets.

Machine Learning

Now are going to do some machine learning for some insight/analysis into the data. We are first going to do a principal component analysis in order to visualize the data better since we are dealing with a lot of dependent variables. Then we are going to have a metric/score of how similar a planet is to Earth on the variables we are considering based on cosine similarity. We are then going to use linear discriminant analysis to classify planets.

Below we are going to do some preliminary things to prepare for our analysis. First, we put in Earth's data for the dependent variables. The dependent variables we are considering are the orbital period, orbit semi major axis, eccentricity, planet mass, effective temperature, and optical magnitude. These are a lot of variables but needed for classifying for similarity to Earth. We then make a new data frame for every row that has all these values which are about 1000 exoplanets. We then finally calculate the cosine similarity between the exoplanet and Earth to get a Earth similarity score. Please note this score is based solely on the variables we are using and only show how close the exoplanets are with the features we are considering. We then append the cosine similarity score to the data frame. We also scale our data for the next few algorithms.

In [18]:
# This is Earth's info for features we are going to use to compare with some of the exoplanets.
earth_mass = 1.0
earth_effective_temp = 5777.0
earth_eccentricity = 0.0167
earth_semi_major_axis = 1.0
earth_orbital_period = 365

earth_data = [[earth_orbital_period, earth_semi_major_axis, earth_eccentricity, earth_mass, earth_effective_temp, 1.0]]

# Create a dataframe of the features we will be using for comparing similarity.
feature_df = exoplanet_df[['orbital_period', 'orbit_semi_major_axis', 'eccentricity', 'planet_mass',
                           'effective_temperature', 'optical_magnitude']]

# Removes rows with NaN.
feature_df = feature_df.dropna()
similarity_list = []

# Iterates over the feature dataframe and calculates the cosine similarity between the features of Earth and the
# current exoplanet.
for index, row in feature_df.iterrows():
    row_features = [[row['orbital_period'], row['orbit_semi_major_axis'], row['eccentricity'], row['planet_mass'], 
                    row['effective_temperature'], row['optical_magnitude']]]
    similarity_score = cosine_similarity(row_features, earth_data)[0][0]
    similarity_list.append(similarity_score)
    
feature_df.reindex
# Adds new column with the calculated Earth similarity score.
feature_df['earth_similarity_score'] = similarity_list
# Scaling the data.
data_scaled = pd.DataFrame(preprocessing.scale(feature_df),columns = feature_df.columns) 
feature_df.head(10)
Out[18]:
orbital_period orbit_semi_major_axis eccentricity planet_mass effective_temperature optical_magnitude earth_similarity_score
0 326.03000 1.290 0.231 6165.9020 4742.0 4.740 0.610660
1 516.21997 1.530 0.080 4684.8142 4213.0 5.016 0.670390
2 185.84000 0.830 0.000 1525.5840 4813.0 5.227 0.953089
3 1773.40002 2.930 0.370 1481.0878 5338.0 6.610 0.935161
4 798.50000 1.660 0.680 565.7374 5750.0 6.250 0.992512
5 993.30000 2.600 0.080 3273.6490 4979.0 5.506 0.833020
7 30.35060 0.190 0.042 289.2253 4893.0 5.580 0.996652
8 452.80000 1.333 0.090 632.4817 5098.0 6.441 0.992149
9 883.00000 2.080 0.290 273.3338 5098.0 6.441 0.992754
17 335.10001 0.990 0.290 4392.4106 6331.0 6.962 0.822049

We are now going to do a principal component analysis on our data (PCA). What this does is reduces our data to its principal components meaning we are only considering the essential parts and removing anything that is unnesccary. This is important to visualize since we are dealing with a good amount of dependent variables. This allows us to measure our data with the principal components rather then just the x and y axis. For more information on PCA refer to this link For the PCA API follow this link. The plot will show us a lower dimension pictured when viewed from the most relevant angle. Please note we are classifying Earth 'likeness' based on a few features.

In [19]:
X = []

# Extracting the features needed for our analysis.
for index, row in feature_df.iterrows():
    X.append([row['orbital_period'], row['orbit_semi_major_axis'], row['eccentricity'], row['planet_mass'], 
              row['effective_temperature'], row['optical_magnitude']])     

# Creating our PCA object.
pca = PCA()
X_pca = pca.fit_transform(data_scaled)
# Plotting our PCA to show the most essential feature.
plt.figure(figsize=(14,8))
plt.scatter(
    X_pca[:,0],
    X_pca[:,1],
    c=X_pca[:,0],
    cmap='rainbow',
    alpha=0.7,
    edgecolors='b'
)
plt.show()

As we can see above all the data is very concentrated towards the left middle part of the graph. Lets zoom in their to get a better picture of what is going on.

In [20]:
# Plotting the same plot except setting limits to the x and y axis
plt.figure(figsize=(14,8))
plt.xlim([-2.5, 5.0])
plt.ylim([-3, 2])
plt.scatter(
    X_pca[:,0],
    X_pca[:,1],
    c=X_pca[:,0],
    cmap='rainbow',
    alpha=0.7,
    edgecolors='b'
)
plt.show()

As we can see here most of the concentration of data is converging toward the left side of the graph, almost to a single point it seems. We can this is the most essential principal component of our data and must be kept preserving the variance in the data. As we move further to the right and see the dots change color, we can see the data is much more scattered and these would represent the principal components that are not very important to us. The PCA library does not have a way to output which components are the most essential very nicely but the ones we will keep for linear discriminant analysis classifier are the planet mass and orbit semi major access which made up most of the variance in our PCA analysis.

Now we are going to do a linear discriminant analysis (LDA). This will be used to classify planets based on our given features. This will also do another dimension reduction like our PCA analysis. First thing we are doing in the code is assigning a classification score based on the similarity score. This is needed for LDA. LDA is useful in that we started with a whole bunch of dependent variables which make logistic and linear regression not a suitable prediction method. You can read more about the LDA API here. A more digestible place to start if you are unfamiliar with LDA is here. We are aiming to divide the differences in groups with this model.

In [21]:
X = []
adjusted_y = []

# Grabbing the two essential features from our PCA analysis.
for index, row in feature_df.iterrows():
    X.append([row['orbit_semi_major_axis'], row['planet_mass']]) 

# Assigning a classification score for our planets. Must need a .999 or above score to be very similar to Earth,
# .98 or above to be somewhat similar and anything else not similar.
for v in similarity_list:
    if v > .999:
        adjusted_y.append(1)
    elif v > .98:
        adjusted_y.append(2)
    else:
        adjusted_y.append(3)

# Creating our LDA object and fitting it.
lda = LinearDiscriminantAnalysis()
X_lda = lda.fit(X, adjusted_y).transform(X)

# Plotting our results.
plt.figure(figsize=(14,8))
plt.xlabel('LD1')
plt.ylabel('LD2')
plt.scatter(
    X_lda[:,0],
    X_lda[:,1],
    c=X_lda[:,0],
    cmap='rainbow',
    alpha=0.7,
    edgecolors='b'
)
plt.show()

As you can see in the above graph you can see where the differences occur between our three classes occur. Let’s zoom in on the cluster of points to the left to get a better picture.

In [22]:
# Same plot as above except with a new axis scale to zoom in on the data.
plt.figure(figsize=(14,8))
plt.xlim([-1.1, 2])
plt.ylim([-2.5, 2])
plt.xlabel('LD1')
plt.ylabel('LD2')
plt.scatter(
    X_lda[:,0],
    X_lda[:,1],
    c=X_lda[:,0],
    cmap='rainbow',
    alpha=0.7,
    edgecolors='b'
)
plt.show()

Now we can see the boundary a little bit better. So now with this model we can input some generic planet and see where it lies on the boundary to see if it is similar enough to Earth and if it meets the habitual properties. With PCA we were able to reduce the number of features needed and then with LDA we used only those essential features to set boundaries on our classifications.

Conclusion

After thorough analysis of the exoplanet data, it can be seen that there are many exoplanets discovered that share similar characteristics to Earth. These characteristics include mass, radius, proximity to the sun, and effective temperature. One of the major challenges in determining if there is actually life on these exoplanets is the fact that we are examining them from several hundred parsecs away. A single parsec is 3.26 light years, and we currently do not have the means to intimately observe from such a distance.

Nonetheless, it is important that we continue to observe and discover exoplanets, so that when the time comes when we are able to get a better look, we will know which exoplanets to look at first.

References