{ "metadata": { }, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
In this lesson, we will be using Python 3 with some of its most popular scientific libraries. This tutorial assumes that the reader is familiar with the fundamentals of the Python programming language, as well as, how to run Python programs using Galaxy. Otherwise, it is advised to follow the “Introduction to Python” tutorial available in the same platform. We will be using JupyterNotebook, a Python interpreter that comes with everything we need for the lesson. Please note: JupyterNotebook is only currently available on the usegalaxy.eu and usegalaxy.org sites.
\n\n\nComment\nThis tutorial is significantly based on the Carpentries Programming with Python and Plotting and Programming in Python, which is licensed CC-BY 4.0.
\nAdaptations have been made to make this work better in a GTN/Galaxy environment.
\n
\n\nAgenda\nIn this tutorial, we will cover:
\n\n
NumPy is a python library and it stands for Numerical Python. In general, you should use this library when you want to perform operations and manipulate numerical data, especially if you have matrices or arrays. To tell Python that we’d like to start using NumPy, we need to import it:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-1", "source": [ "import numpy as np" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-2", "source": "A Numpy array contains one or more elements of the same type. To examine the basic functions of the library, we will create an array of random data. These data will correspond to arthritis patients’ inflammation. The rows are the individual patients, and the columns are their daily inflammation measurements. We will use the random.randint()
function. It has 4 arguments as inputs randint(low, high=None, size=None, dtype=int)
. low
nad high
specify the limits of the random number generator. size
determines the shape of the array and it can be an integer or a tuple.
If we want to check the data have been loaded, we can print the variable’s value:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-5", "source": [ "print(random_data)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-6", "source": "Now that the data are in memory, we can manipulate them. First, let’s ask what type of thing data refers to:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-7", "source": [ "print(type(random_data))" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-8", "source": "The output tells us that data currently refers to an N-dimensional array, the functionality for which is provided by the NumPy library. These data correspond to arthritis patients’ inflammation. The rows are the individual patients, and the columns are their daily inflammation measurements.
\nThe type
function will only tell you that a variable is a NumPy array but won’t tell you the type of thing inside the array. We can find out the type of the data contained in the NumPy array.
This tells us that the NumPy array’s elements are integer numbers.
\nWith the following command, we can see the array’s shape:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-11", "source": [ "print(random_data.shape)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-12", "source": "The output tells us that the data array variable contains 50 rows and 70 columns. When we created the variable random_data
to store our arthritis data, we did not only create the array; we also created information about the array, called members or attributes. This extra information describes random_data
in the same way an adjective describes a noun. random_data.shape
is an attribute of random_data
which describes the dimensions of random_data
.
If we want to get a single number from the array, we must provide an index in square brackets after the variable name, just as we do in math when referring to an element of a matrix. Our data has two dimensions, so we will need to use two indices to refer to one specific value:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-13", "source": [ "print('first value in data:', random_data[0, 0])" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-14", "source": "\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-15", "source": [ "print('middle value in data:', random_data[25, 35])" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-16", "source": "The expression random_data[25, 35] accesses the element at row 25, column 35. While this expression may not surprise you, random_data[0, 0] might. Programming languages like Fortran, MATLAB and R start counting at 1 because that’s what human beings have done for thousands of years. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because it represents an offset from the first value in the array (the second value is offset by one index from the first value). As a result, if we have an M×N array in Python, its indices go from 0 to M-1 on the first axis and 0 to N-1 on the second.
\nSlicing data\nAn index like [25, 35] selects a single element of an array, but we can select whole sections as well, using slicing the same way as previously with the strings. For example, we can select the first ten days (columns) of values for the first four patients (rows) like this:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-17", "source": [ "print(random_data[0:4, 0:10])" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-18", "source": "We don’t have to include the upper and lower bound on the slice. If we don’t include the lower bound, Python uses 0 by default; if we don’t include the upper, the slice runs to the end of the axis, and if we don’t include either (i.e., if we use ‘:’ on its own), the slice includes everything:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-19", "source": [ "small = random_data[:3, 36:]\n", "print('small is:')\n", "print(small)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-20", "source": "The above example selects rows 0 through 2 and columns 36 through to the end of the array.
\nNumPy has several useful functions that take an array as input to perform operations on its values. If we want to find the average inflammation for all patients on all days, for example, we can ask NumPy to compute random_data’s mean value:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-21", "source": [ "print(np.mean(random_data))" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-22", "source": "Let’s use three other NumPy functions to get some descriptive values about the dataset. We’ll also use multiple assignment, a convenient Python feature that will enable us to do this all in one line.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-23", "source": [ "maxval, minval, stdval = np.max(random_data), np.min(random_data), np.std(random_data)\n", "\n", "print('maximum inflammation:', maxval)\n", "print('minimum inflammation:', minval)\n", "print('standard deviation:', stdval)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-24", "source": "How did we know what functions NumPy has and how to use them? If you are working in IPython or in a Jupyter Notebook, there is an easy way to find out. If you type the name of something followed by a dot, then you can use tab completion (e.g. type np.
and then press Tab) to see a list of all functions and attributes that you can use. After selecting one, you can also add a question mark (e.g. np.cumprod?
), and IPython will return an explanation of the method! This is the same as doing help(np.cumprod)
.
When analyzing data, though, we often want to look at variations in statistical values, such as the maximum inflammation per patient or the average inflammation per day. One way to do this is to create a new temporary array of the data we want, then ask it to do the calculation:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-25", "source": [ "patient_0 = random_data[0, :] # 0 on the first axis (rows), everything on the second (columns)\n", "print('maximum inflammation for patient 0:', np.max(patient_0))" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-26", "source": "What if we need the maximum inflammation for each patient over all days (as in the next diagram on the left) or the average for each day (as in the diagram on the right)? As the diagram below shows, we want to perform the operation across an axis:
\nTo support this functionality, most array functions allow us to specify the axis we want to work on. If we ask for the average across axis 0 (rows in our 2D example), we get:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-27", "source": [ "print(np.mean(random_data, axis=0))" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-28", "source": "As a quick check, we can ask this array what its shape is:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-29", "source": [ "print(np.mean(random_data, axis=0).shape)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-30", "source": "The expression (70,) tells us we have an N×1 vector, so this is the average inflammation per day for all patients. If we average across axis 1 (columns in our 2D example), we get the average inflammation per patient across all days.:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-31", "source": [ "print(np.mean(random_data, axis=1))" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-32", "source": "Arrays can be concatenated and stacked on top of one another, using NumPy’s vstack
and hstack
functions for vertical and horizontal stacking, respectively.
Sometimes there are missing values in an array, that could make it difficult to perform operations on it. To remove the NaN
you must first find their indexes and then replace them. The following example replaces them with 0
.
\n\nQuestion: Selecting and stacking arrays\nWrite some additional code that slices the first and last columns of A, and stacks them into a 3x2 array. Make sure to print the results to verify your solution.
\n\n👁 View solution
\n\nA ‘gotcha’ with array indexing is that singleton dimensions are dropped by default. That means
\nA[:, 0]
is a one dimensional array, which won’t stack as desired. To preserve singleton dimensions, the index itself can be a slice or array. For example,A[:, :1]
returns a two dimensional array with one singleton dimension (i.e. a column vector).\nD = np.hstack((A[:, :1], A[:, -1:]))\nprint('D = ')\nprint(D)\n
\n\n\nQuestion: Selecting with conditionals\nGiven the followind array
\nA
, keep only the elements that are lower that0.05
.\nA = np.array([0.81, 0.025, 0.15, 0.67, 0.01])\n
👁 View solution
\n\n\nA = A[A<0.05]\n
Pandas (pandas development team 2020, Wes McKinney 2010 ) is a widely-used Python library for statistics, particularly on tabular data. If you are familiar with R dataframes, then this is the library that integrates this functionality. A dataframe is a 2-dimensional table with indexes and column names. The indexes indicate the difference in rows, while the column names indicate the difference in columns. You will see later that these two features are useful when you’re manipulating your data. Each column can contain different data types.
\nLoad it with import pandas as pd
. The alias pd
is commonly used for pandas.
There are many ways to create a pandas dataframe. For example you can use a numpy array as input.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-39", "source": [ "data = np.array([['','Col1','Col2'],\n", " ['Row1',1,2],\n", " ['Row2',3,4]])\n", "\n", "print(pd.DataFrame(data=data[1:,1:],\n", " index=data[1:,0],\n", " columns=data[0,1:]))" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-40", "source": "For the purposes of this tutorial, we will use a file with the annotated differentially expressed genes that was produced in the Reference-based RNA-Seq data analysis tutorial
\nWe can read a tabular file with pd.read_csv
. The first argument is the filepath of the file to be read. The sep
argument refers to the symbol used to separate the data into different columns. You can check the rest of the arguments using the help()
function.
The columns in a dataframe are the observed variables, and the rows are the observations. Pandas uses backslash \\
to show wrapped lines when output is too wide to fit the screen.
You can use index_col
to specify that a column’s values should be used as row headings.
By default row indexes are numbers, but we could use a column of the data. To pass the name of the column to read_csv
, you can use its index_col
parameter. Be careful though, because the row indexes must be unique for each row.
You can use the DataFrame.info()
method to find out more about a dataframe.
We learn that this is a DataFrame. It consists of 130 rows and 12 columns. None of the columns contains any missing values. 6 columns contain 64-bit floating point float64
values, 2 contain 64-bit integer int64
values and 4 contain character object
values. It uses 13.2KB of memory.
The DataFrame.columns
variable stores information about the dataframe’s columns.
Note that this is an attribute, not a method. (It doesn’t have parentheses.) Called a member variable, or just member.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-47", "source": [ "print(data.columns)" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-48", "source": "You could use DataFrame.T
to transpose a dataframe. The Transpose
(written .T
) doesn’t copy the data, just changes the program’s view of it. Like columns, it is a member variable.
You can use DataFrame.describe()
to get summary statistics about the data. DataFrame.describe()
returns the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument include='all'
. Depending on the data type of each column, the statistics that can’t be calculated are replaced with the value NaN
.
\n\nQuestion: Using pd.head and pd.tail\nAfter reading the data, use
\nhelp(data.head)
andhelp(data.tail)
to find out whatDataFrame.head
andDataFrame.tail
do.\n\ta. What method call will display the first three rows of the data?\n\tb. What method call will display the last three columns of this data? (Hint: you may need to change your view of the data.)\n👁 View solution
\n\na. We can check out the first five rows of the data by executing
\ndata.head()
(allowing us to view the head of the DataFrame). We can specify the number of rows we wish to see by specifying the parametern
in our call todata.head()
. To view the first three rows, execute:\ndata.head(n=3)\n
\n\n
\n\n \n\n\n\n Base mean \nlog2(FC) \nStdErr \nWald-Stats \nP-value \nP-adj \nChromosome \nStart \nEnd \nStrand \nFeature \nGene name \nGeneID \n\n \nFBgn0039155 \n1086.974295 \n-4.148450 \n0.134949 \n-30.740913 \n1.617357e-207 \n1.387207e-203 \nchr3R \n24141394 \n24147490 \n+ \nprotein_coding \nKal1 \n\n \n \nFBgn0003360 \n6409.577128 \n-2.999777 \n0.104345 \n-28.748637 \n9.419922e-182 \n4.039734e-178 \nchrX \n10780892 \n10786958 \n- \nprotein_coding \nsesB \n\n \n \n\nFBgn0026562 \n65114.840564 \n-2.380164 \n0.084327 \n-28.225437 \n2.850430e-175 \n8.149380e-172 \nchr3R \n26869237 \n26871995 \n- \nprotein_coding \nBM-40-SPARC \n\n b. To check out the last three rows, we would use the command,
\ndata.tail(n=3)
, analogous tohead()
used above. However, here we want to look \tat the last three columns so we need to change our view and then usetail()
. To do so, we create a new DataFrame in which rows and columns are \tswitched:\ndata_flipped = data.T\n
We can then view the last three columns of the data by viewing the last three rows of data_flipped:
\n\ndata_flipped.tail(n=3)\n
| GeneID | FBgn0039155 | FBgn0003360 | FBgn0026562 | FBgn0025111 | FBgn0029167 | FBgn0039827 | FBgn0035085 | FBgn0034736 | FBgn0264475 | FBgn0000071 | … | FBgn0264343 | FBgn0038237 | FBgn0020376 | FBgn0028939 | FBgn0036560 | FBgn0035710 | FBgn0035523 | FBgn0038261 | FBgn0039178 | FBgn0034636 |\n| —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- | —- |\n| Strand | + | - | - | - | + | + | + | + | + | + | … | + | - | + | + | + | - | + | + | + | - |\n| Feature | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | lincRNA | protein_coding | … | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding | protein_coding |\n| Gene name | Kal1 | sesB | BM-40-SPARC | Ant2 | Hml | CG1544 | CG3770 | CG6018 | CR43883 | Ama | … | CG43799 | Pde6 | Sr-CIII | NimC2 | CG5895 | SP1173 | CG1311 | CG14856 | CG6356 | CG10440 |
\n
\n\n\nQuestion: Saving in a csv file\nAs well as the
\nread_csv
function for reading data from a file, Pandas provides ato_csv
function to write dataframes to files. Applying what you’ve learned about reading from files, write one of your dataframes to a file calledprocessed.csv
. You can usehelp
to get information on how to useto_csv
.👁 View solution
\n\n\ndata_flipped.to_csv('processed.csv')\n
A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.
\nPandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.
\nWhat makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.
\nTo access a value at the position [i,j]
of a DataFrame, we have two options, depending on what is the meaning of i in use. Remember that a DataFrame provides an index as a way to identify the rows of the table; a row, then, has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.
You can use DataFrame.iloc[..., ...]
to select values by their (entry) position and basically specify location by numerical index analogously to 2D version of character selection in strings.
You can also use DataFrame.loc[..., ...]
to select values by their (entry) label and basically specify location by row name analogously to 2D version of dictionary keys.
You can use Python’s usual slicing notation, to select all or a subset of rows and/or columns. For example, the following code selects all the columns of the row \"FBgn0039155\"
.
Which would get the same result as printing data.loc[\"FBgn0039155\"]
(without a second index).
You can select multiple columns or rows using DataFrame.loc
and a named slice or Dataframe.iloc
and the numbers corresponding to the rows and columns.
When choosing or transitioning between loc
and iloc
, you should keep in mind that the two methods use slightly different indexing schemes.
iloc
uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10
will select entries 0,...,9
. loc
, meanwhile, indexes inclusively. So 0:10
will select entries 0,...,10
.
This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000
. In this case df.iloc[0:1000]
will return 1000 entries, while df.loc[0:1000]
return 1001 of them! To get 1000 elements using loc
, you will need to go one lower and ask for df.loc[0:999]
.
The result of slicing is a new dataframe and can be used in further operations. All the statistical operators that work on entire dataframes work the same way on slices. E.g., calculate max of a slice.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-61", "source": [ "print(data.loc['FBgn0003360':'FBgn0029167', 'Base mean'].max())" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-62", "source": "You can use conditionals to select data. A comparison is applied element by element and returns a similarly-shaped dataframe of True
and False
. The last one can be used as a mask to subset the original dataframe. The following example creates a new dataframe consisting only of the columns ‘P-adj’ and ‘Gene name’, then keeps the rows that comply with the expression 'P-adj' < 0.000005
If we have not had specified the column, that the expression should be applied to, then it would have been applied to the entire dataframe. But the dataframe contains different type of data. In that case, an error would occur.
\nConsider the following example of a dataframe consisting only of numerical data. The expression and the mask would be normally applied to the data and the mask would return NaN
for the data that don’t comply with the expression.
This is very useful because NaNs are ignored by operations like max, min, average, etc.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-69", "source": [ "print(new_data.describe())" ], "cell_type": "code", "execution_count": null, "outputs": [ ], "metadata": { "attributes": { "classes": [ "python" ], "id": "" } } }, { "id": "cell-70", "source": "\n\nQuestion: Manipulating dataframes\nExplain what each line in the following short program does: what is in first, second, etc.?
\n\nfirst = pd.read_csv(\"https://zenodo.org/record/3477564/files/annotatedDEgenes.tabular\", sep = \"\\t\", index_col = 'GeneID')\nsecond = first[first['log2(FC)'] > 0 ]\nthird = second.drop('FBgn0025111')\nfourth = third.drop('StdErr', axis = 1)\nfourth.to_csv('result.csv')\n
\n👁 View solution
\n\nLet’s go through this piece of code line by line.
\n\nfirst = pd.read_csv(\"https://zenodo.org/record/3477564/files/annotatedDEgenes.tabular\", sep = \"\\t\", index_col = 'GeneID')\n
This line loads the data into a dataframe called first. The
\nindex_col='GeneID'
parameter selects which column to use as the row labels in the dataframe.\nsecond = first[first['log2(FC)'] > 0 ]\n
This line makes a selection: only those rows of first for which the ‘log2(FC)’ column contains a positive value are extracted. Notice how the Boolean expression inside the brackets is used to select only those rows where the expression is true.
\n\nthird = second.drop('FBgn0025111')\n
As the syntax suggests, this line drops the row from second where the label is ‘FBgn0025111’. The resulting dataframe third has one row less than the original dataframe second.
\n\nfourth = third.drop('StdErr', axis = 1)\n
Again we apply the drop function, but in this case we are dropping not a row but a whole column. To accomplish this, we need to specify also the axis parameter.
\n\nfourth.to_csv('result.csv')\n
The final step is to write the data that we have been working on to a csv file. Pandas makes this easy with the
\nto_csv()
function. The only required argument to the function is the filename. Note that the file will be written in the directory from which you started the Jupyter or Python session.
\n\n\nQuestion: Finding min-max indexes\nExplain in simple terms what
\nidxmin
andidxmax
do in the short program below. When would you use these methods?\ndata = pd.read_csv(\"https://zenodo.org/record/3477564/files/annotatedDEgenes.tabular\", sep = \"\\t\", index_col = 'GeneID')\n\nprint(data['Base mean'].idxmin())\nprint(data['Base mean'].idxmax())\n
👁 View solution
\n\n\n
idxmin
will return the index value corresponding to the minimum; idxmax will do the same for the maximum value.You can use these functions whenever you want to get the row index of the minimum/maximum value and not the actual minimum/maximum value.
\nOutput:\nFBgn0063667\nFBgn0026562
\n
\n\n\nQuestion: Selecting with conditionals\nAssume Pandas has been imported and the previous dataset has been loaded. Write an expression to select each of the following:\na. P-value of each gene\nb. all the information of gene
\nFBgn0039178
\nc. the information of all genes that belong to chromosomechr3R
👁 View solution
\n\na.
\ndata['P-value']
\nb.data.loc['FBgn0039178', :]
\nc.data[data['Chromosome'] == 'chr3R']
Many data analysis tasks can be approached using the “split-apply-combine” paradigm: split the data into groups, apply some analysis to each group, and then combine the results.
\nPandas makes this very easy through the use of the groupby()
method, which splits the data into groups. When the data is grouped in this way, the aggregate method agg()
can be used to apply an aggregating or summary function to each group.
There are a couple of things that should be noted. The agg()
method accepts a dictionary as input that specifies the function to be applied to each column. The output is a new dataframe, that each row corresponds to one group. The output dataframe uses the grouping column as index. We could change the last one by simply using the reset_index()
method.
\n\nQuestion: Finding the max of each group\nUsing the same dataset, try to find the longest genes in each chromosome.
\n\n👁 View solution
\n\n\ndata['Gene Length'] = data['End'] - data['Start']\ndata.groupby('Chromosome').agg(max_length = ('Gene Length', 'max'))\n
\n\n\nQuestion: Grouping with multiple variables\nUsing the same dataset, try to find how many genes are found on each strand of each chromosome.
\n👁 View solution
\n\nYou can group the data according to more than one column.
\n\ndata.groupby(['Chromosome', 'Strand']).size()\n
This tutorial aims to serve as an introduction to data analysis using the Python programming language. We hope you feel more confident in Python!
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "cell_type": "markdown", "id": "final-ending-cell", "metadata": { "editable": false, "collapsed": false }, "source": [ "# Key Points\n\n", "- Python has many libraries offering a variety of capabilities, which makes it popular for beginners, as well as, more experienced users\n", "- You can use scientific libraries like Numpy and Pandas to perform data analysis.\n", "\n# Congratulations on successfully completing this tutorial!\n\n", "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-advanced-np-pd/tutorial.html#feedback) and check there for further resources!\n" ] } ] }