-
-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pan Deng: Integrating pandas.Panel
and xarray Features
#127
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
#Integrating `pandas.Panel` and xarray Features | ||
|
||
## Abstract | ||
|
||
Pandas package excels at processing tabular models, especially 1-D and 2-D models with data structure `pd.Series` and `pd.DataFrame`. However, despite the specialized data structure for 3-D and N-D data, `pd.Panel`, pandas is not well-designed for high dimensional data processing. Xarray was then developed to compensate the weak part of pandas. Xarray package enabled statistic and mathematical analysis of high-dimensional data, and implemented several new features that are missing in pandas. | ||
|
||
However, some useful features in pandas were not implemented in xarray yet. Thus, my project will be focusing on porting features from pandas to xarray, and improving data structure conversion between pandas and xarray. By the end of my work, I expect pandas/xarray users can have specialized tools for different types of data processing, and migrate between pandas and xarray smoothly. | ||
|
||
## Technical Details | ||
|
||
Most of my proposal is supposed to be carried out with current implemented features in pandas and xarray. For PCA part, to improve the performance, I might switch to C++ and Eigen3 library. | ||
|
||
|
||
## Schedule of Deliverables | ||
|
||
###Milestone 1: Migrate features from pandas to xarray | ||
|
||
**Week 1 - 5, May 23rd - June 26th** | ||
|
||
- I will have my coursework finished within the first two weeks | ||
- I will implement more multiIndex-related features and port features from pandas to xarray. | ||
|
||
- **Week 1:** Enable multi-datatype in `xr.DataArray` like `pd.panel` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. xarray builds its core data structures on NumPy arrays, so this isn't feasible. It's also an intentional design decision not to build multiple dtypes into the array data structure -- we have |
||
- **Week 2:** Enable groupby for one-dimensional variables / Make levels accessible as coordinate variables | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 1d groupby in xarray already works -- what are you thinking of here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My bad. And thanks I'll include pydata/xarray#659. |
||
- **Week 3:** Based on [#702](https://github.com/pydata/xarray/pull/702) and according to [#719](https://github.com/pydata/xarray/issues/719), I will implement selection return objects with MultiIndex, and add `set_index`/`reset_index`/`swaplevel` to make it easier to create and manipulate multi-indexes. | ||
- **Week 4 - 5:** Port features from pd.panel to `xr.DataArray`, such as `prod`, `cumsum`, `rank`, and etc.. Ref [#791](https://github.com/pydata/xarray/issues/791) | ||
|
||
###Milestone 2: Type conversion between pandas and xarray data structure | ||
|
||
**Week 6 - 9, June 27th - July 24th** | ||
|
||
- To ensure that data can migrate between pandas and xarray data structure smoothly, I want to spend time on optimizing the conversion between data types. I will check current conversion methods, and implement more features - from multi-indexed/ hierarchical-indexed `pd.DataFrame` to `xr.DataSet`; from xr.DataSet to 2-D pd.DataFrame with PCA. | ||
|
||
- **Week 6:** Test current conversion methods with newly added features in Milestone1 and fix existing bugs. | ||
- **Week 7:** Based on [#702](https://github.com/pydata/xarray/pull/702) and Milestone1, I am going to take one step further and implement conversion from multiIndexed `pd.DataFrame` to `xr.DataSet`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This actually already works in the latest release of xarray -- is there an outstanding issue about this you found? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nope sorry I missed it. The latest update about multiIndex I saw is in pydata/xarray#702. |
||
- **Week 8 - 9:** It might be useful to compress multi-dimensional metadata into 2-dimensional or 3-dimentional dataset and view in matplotlib. I am going to implement PCA method in xarray, and enable conversion from `xr.DataSet` to `pd.DataFrame` via PCA with certain data types. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PCA could make sense, but it should be wrapping the existing implementations of the numerics in scipy, scikitlearn and/or dask.array. |
||
|
||
###Milestone 3: Port xarray features to pandas | ||
|
||
**Week 10 - 11, July 25th - Aug 7th** | ||
|
||
- `pd.DataFrame` currently supports basic `.loc` indexing. I am going to introduce `xr.loc` back into pd.DataFrame to enable dictionary syntax for indexing. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. well if you mean adding support for dictionary indexers, then ok. |
||
- I will also check for other useful features in xarray and migrate back to pandas. | ||
|
||
###Milestone 4: Clean-up and Wrap-up | ||
|
||
**Week 12 - 13, Aug. 8th - Aug. 23rd** | ||
|
||
- Checked for and fix opening issues for xarray. | ||
- Clean-up codes and finish documentations. | ||
|
||
## Future works | ||
I would like to continue working on pandas and xarray project, to make pandas and xarray better libraries. | ||
|
||
## Open Source Development Experience | ||
I have been using several open-source packages for my lab projects, while I just started to contribute to open-source projects. I fixed several bugs for Pandas package in Python, and contributed patches for Shogun, a C++ - based machine learning toolkit. | ||
|
||
I have gained a lot from my limited experience. First, trying to solve problems kicked me out of my comfort zone and drove me to learn new skills. Also, reading other developers' codes and talking to people interactively greatly improved my insights in coding. Meanwhile, I started to learn the rules for developing and maintaining giant projects. Thus, open-source projects development experience benefitted and will benefit me a lot. | ||
|
||
##Patch samples | ||
- **Pandas:** | ||
- [#12614](https://github.com/pydata/pandas/pull/12614) Fixed bug: `pd.crosstab` `margins=True` ignoring dropna | ||
- [#12650](https://github.com/pydata/pandas/pull/12650) Fixed bug: `pd.pivot_table` `dropna=False` drops columns/index names | ||
|
||
- **Shogun:** (a C++ - based machine learning toolkit) | ||
- [#3092](https://github.com/shogun-toolbox/shogun/pull/3092) Project-wise FLAG removal | ||
- [#3096](https://github.com/shogun-toolbox/shogun/pull/3092) API: add mean computation to linear algebra library | ||
|
||
## Informatics skills | ||
|
||
**Python/Numpy/Pandas** | ||
- COURSES: | ||
- Fundamentals of Computing Specialization (Coursera, License: 5ePAwImNEe) | ||
- Bioinformatics Algorithms (Coursera) | ||
- PROJECTS: | ||
- Genome-wide Mitochondrial Targeting Prediction in C. elegans | ||
- RNA-seq Analysis of MSAF-1 Mutant Gene Expression Profile in C. elegans | ||
|
||
**Linear Algebra/Statistics** | ||
- COURSES: | ||
- Linear Algebra and Analytical Geometry (Undergraduate, grade 95/100) | ||
- Probability and Statistics (Undergraduate, grade 99/100) | ||
- Stochastic Mathematical Methods (Undergraduate, grade 98/100) | ||
- Quantitative Genomics and Genetics (Graduate, Honor) | ||
- PROJECTS: | ||
- S phase genes identification and pathway enrichment study via microarray and singular value decomposition (SVD) | ||
- Genome-wide Association Study (GWAS) on Casual Polymorphisms Regulating Gene CG9186 Expression at Pupation Stage in Drosophila melanogaster | ||
|
||
**Other skills: R, Matlab, Java, C++, bash** | ||
|
||
## Academic Experience | ||
|
||
Sept. 2013 - present | ||
- **Memorial Sloan Kettering Cancer Center** | ||
- Graduate Research Assistant, | ||
- Cell Department, Laboratory of Dr. Cole Haynes | ||
|
||
Sept. 2012 - present | ||
- **Weill Cornell Graduate School of Medical Sciences** | ||
- Graduate student, | ||
- Biochemistry, Cell and Molecular Biology Allied Program | ||
|
||
Aug. 2008 - Jul. 2012 | ||
- **Tsinghua University, China** | ||
- Undergraduate Student, School of Life Sciences, Major in Biological Sciences | ||
- Graduate with Honor | ||
- Distinguished Dissertation | ||
|
||
## Why this project? | ||
|
||
I have used the pandas package before in my lab project - deep sequencing data analysis. I also started to contribute to pandas by fixing bugs: `pd.crosstab margins` ignoring `dropna` and `pd.pivot_table` `dropna=False` drops columns/index names. | ||
|
||
I am interested in becoming a data scientist, in which career python and pandas are powerful and popular tools. I would like to join the development of pandas, because it helps me understand pandas better, as well as improves my coding skills and insights. | ||
|
||
I look forward to joining the panel/xarray project because the project offers me the chance to review and study the general features and advantages/disadvantages of both pandas and xarray packages. I believe I can benefit a lot from finishing the work and so do pandas/xarray. | ||
|
||
## Contact information | ||
Name: Pan Deng | ||
|
||
E-mail: [email protected] | ||
|
||
Github: OXPHOS | ||
|
||
IRC: OXPHOS | ||
|
||
Time zone: UTC-05:00 | ||
|
||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently
xarray
is pure python, this would be a very large change. usingcython
and/ornumba
might be acceptable. I suspect thatxarray
will eventually have to go down this route. @shoyer can shed some more light.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xarray is currently pure Python built on top of NumPy and dask.array. Indeed, Cython or Numba might be acceptable, but Numba at least would need to be an optional dependency.
For PCA in particular, it would make sense to wrap existing implementations/wrappers in SciPy rather than rolling your own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get it..No experience with cython or SciPy. I will check it out.