Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pan Deng: Integrating pandas.Panel and xarray Features #127

Merged
merged 3 commits into from
Mar 25, 2016
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions 2016/proposals/deng-pan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
#Integrating `pandas.Panel` and xarray Features

## Abstract

Pandas package excels at processing tabular models, especially 1-D and 2-D models with data structure `pd.Series` and `pd.DataFrame`. However, despite the specialized data structure for 3-D and N-D data, `pd.Panel`, pandas is not well-designed for high dimensional data processing. Xarray was then developed to compensate the weak part of pandas. Xarray package enabled statistic and mathematical analysis of high-dimensional data, and implemented several new features that are missing in pandas.

However, some useful features in pandas were not implemented in xarray yet. Thus, my project will be focusing on porting features from pandas to xarray, and improving data structure conversion between pandas and xarray. By the end of my work, I expect pandas/xarray users can have specialized tools for different types of data processing, and migrate between pandas and xarray smoothly.

## Technical Details

Most of my proposal is supposed to be carried out with current implemented features in pandas and xarray. For PCA part, to improve the performance, I might switch to C++ and Eigen3 library.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently xarray is pure python, this would be a very large change. using cython and/or numba might be acceptable. I suspect that xarray will eventually have to go down this route. @shoyer can shed some more light.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xarray is currently pure Python built on top of NumPy and dask.array. Indeed, Cython or Numba might be acceptable, but Numba at least would need to be an optional dependency.

For PCA in particular, it would make sense to wrap existing implementations/wrappers in SciPy rather than rolling your own.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it..No experience with cython or SciPy. I will check it out.



## Schedule of Deliverables

###Milestone 1: Migrate features from pandas to xarray

**Week 1 - 5, May 23rd - June 26th**

- I will have my coursework finished within the first two weeks
- I will implement more multiIndex-related features and port features from pandas to xarray.

- **Week 1:** Enable multi-datatype in `xr.DataArray` like `pd.panel`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xarray builds its core data structures on NumPy arrays, so this isn't feasible. It's also an intentional design decision not to build multiple dtypes into the array data structure -- we have xr.Dataset for multiple dtypes.

- **Week 2:** Enable groupby for one-dimensional variables / Make levels accessible as coordinate variables
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1d groupby in xarray already works -- what are you thinking of here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. And thanks I'll include pydata/xarray#659.

- **Week 3:** Based on [#702](https://github.com/pydata/xarray/pull/702) and according to [#719](https://github.com/pydata/xarray/issues/719), I will implement selection return objects with MultiIndex, and add `set_index`/`reset_index`/`swaplevel` to make it easier to create and manipulate multi-indexes.
- **Week 4 - 5:** Port features from pd.panel to `xr.DataArray`, such as `prod`, `cumsum`, `rank`, and etc.. Ref [#791](https://github.com/pydata/xarray/issues/791)

###Milestone 2: Type conversion between pandas and xarray data structure

**Week 6 - 9, June 27th - July 24th**

- To ensure that data can migrate between pandas and xarray data structure smoothly, I want to spend time on optimizing the conversion between data types. I will check current conversion methods, and implement more features - from multi-indexed/ hierarchical-indexed `pd.DataFrame` to `xr.DataSet`; from xr.DataSet to 2-D pd.DataFrame with PCA.

- **Week 6:** Test current conversion methods with newly added features in Milestone1 and fix existing bugs.
- **Week 7:** Based on [#702](https://github.com/pydata/xarray/pull/702) and Milestone1, I am going to take one step further and implement conversion from multiIndexed `pd.DataFrame` to `xr.DataSet`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually already works in the latest release of xarray -- is there an outstanding issue about this you found?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope sorry I missed it. The latest update about multiIndex I saw is in pydata/xarray#702.
I thought though the method is ready but we still need to include it in several upper-level(?) features. I will take a closer look.

- **Week 8 - 9:** It might be useful to compress multi-dimensional metadata into 2-dimensional or 3-dimentional dataset and view in matplotlib. I am going to implement PCA method in xarray, and enable conversion from `xr.DataSet` to `pd.DataFrame` via PCA with certain data types.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PCA could make sense, but it should be wrapping the existing implementations of the numerics in scipy, scikitlearn and/or dask.array.


###Milestone 3: Port xarray features to pandas

**Week 10 - 11, July 25th - Aug 7th**

- `pd.DataFrame` currently supports basic `.loc` indexing. I am going to introduce `xr.loc` back into pd.DataFrame to enable dictionary syntax for indexing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well if you mean adding support for dictionary indexers, then ok.

- I will also check for other useful features in xarray and migrate back to pandas.

###Milestone 4: Clean-up and Wrap-up

**Week 12 - 13, Aug. 8th - Aug. 23rd**

- Checked for and fix opening issues for xarray.
- Clean-up codes and finish documentations.

## Future works
I would like to continue working on pandas and xarray project, to make pandas and xarray better libraries.

## Open Source Development Experience
I have been using several open-source packages for my lab projects, while I just started to contribute to open-source projects. I fixed several bugs for Pandas package in Python, and contributed patches for Shogun, a C++ - based machine learning toolkit.

I have gained a lot from my limited experience. First, trying to solve problems kicked me out of my comfort zone and drove me to learn new skills. Also, reading other developers' codes and talking to people interactively greatly improved my insights in coding. Meanwhile, I started to learn the rules for developing and maintaining giant projects. Thus, open-source projects development experience benefitted and will benefit me a lot.

##Patch samples
- **Pandas:**
- [#12614](https://github.com/pydata/pandas/pull/12614) Fixed bug: `pd.crosstab` `margins=True` ignoring dropna
- [#12650](https://github.com/pydata/pandas/pull/12650) Fixed bug: `pd.pivot_table` `dropna=False` drops columns/index names

- **Shogun:** (a C++ - based machine learning toolkit)
- [#3092](https://github.com/shogun-toolbox/shogun/pull/3092) Project-wise FLAG removal
- [#3096](https://github.com/shogun-toolbox/shogun/pull/3092) API: add mean computation to linear algebra library

## Informatics skills

**Python/Numpy/Pandas**
- COURSES:
- Fundamentals of Computing Specialization (Coursera, License: 5ePAwImNEe)
- Bioinformatics Algorithms (Coursera)
- PROJECTS:
- Genome-wide Mitochondrial Targeting Prediction in C. elegans
- RNA-seq Analysis of MSAF-1 Mutant Gene Expression Profile in C. elegans

**Linear Algebra/Statistics**
- COURSES:
- Linear Algebra and Analytical Geometry (Undergraduate, grade 95/100)
- Probability and Statistics (Undergraduate, grade 99/100)
- Stochastic Mathematical Methods (Undergraduate, grade 98/100)
- Quantitative Genomics and Genetics (Graduate, Honor)
- PROJECTS:
- S phase genes identification and pathway enrichment study via microarray and singular value decomposition (SVD)
- Genome-wide Association Study (GWAS) on Casual Polymorphisms Regulating Gene CG9186 Expression at Pupation Stage in Drosophila melanogaster

**Other skills: R, Matlab, Java, C++, bash**

## Academic Experience

Sept. 2013 - present
- **Memorial Sloan Kettering Cancer Center**
- Graduate Research Assistant,
- Cell Department, Laboratory of Dr. Cole Haynes

Sept. 2012 - present
- **Weill Cornell Graduate School of Medical Sciences**
- Graduate student,
- Biochemistry, Cell and Molecular Biology Allied Program

Aug. 2008 - Jul. 2012
- **Tsinghua University, China**
- Undergraduate Student, School of Life Sciences, Major in Biological Sciences
- Graduate with Honor
- Distinguished Dissertation

## Why this project?

I have used the pandas package before in my lab project - deep sequencing data analysis. I also started to contribute to pandas by fixing bugs: `pd.crosstab margins` ignoring `dropna` and `pd.pivot_table` `dropna=False` drops columns/index names.

I am interested in becoming a data scientist, in which career python and pandas are powerful and popular tools. I would like to join the development of pandas, because it helps me understand pandas better, as well as improves my coding skills and insights.

I look forward to joining the panel/xarray project because the project offers me the chance to review and study the general features and advantages/disadvantages of both pandas and xarray packages. I believe I can benefit a lot from finishing the work and so do pandas/xarray.

## Contact information
Name: Pan Deng

E-mail: [email protected]

Github: OXPHOS

IRC: OXPHOS

Time zone: UTC-05:00