Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can you give me a spark data process demo? #1

Open
aixuedegege opened this issue Jan 28, 2019 · 1 comment
Open

can you give me a spark data process demo? #1

aixuedegege opened this issue Jan 28, 2019 · 1 comment

Comments

@aixuedegege
Copy link

I want to use Spark to create a data processing process, starting with downloading pictures, then scaling and cutting pictures and so on. How can I use your framework ?May you give me a demo? Thanks a lot!

@moncho-mendez
Copy link
Collaborator

moncho-mendez commented Jan 30, 2019

Hello

Sorry about the delay but we have to do some source to attend your request.

We have not integrated this product with Spark and unfortunately, we have not a demo (but we have made an example as quick as we could and is attached). This project is a simple pipeline implementation derived from the pipeline of Mallet (some source has brought from there) with some interesting features (some of them to appear). Interesting features are:

  • Input-Output Type checking: support is developed and enabled but I am afraid that checking should be invoked manually (not while building pipes as implemented now).
  • alwaysBefore and notAfter constraints. The former are pipes that should be executed before. For instance in textMining, if you have a StopWordRemoverPipe (that find and remove stopwords using a list of uppercase stopwords), this pipe would have an alwaysBefore dependence with TransformTextToUppercase. NotAfter dependencies represent pipes that cannot be executed after one. For instance, if you have a pipe to drop HTML tags (DropHTMLTags) this pipe would have a notAfter dependence with another pipe that detects if the content is HTML (DetectHTML).
  • Load pipe infrastructure from XML (under development).
  • Dynamic loading of jars containing pipes (currently studying libraries and methods to implement this feature).
  • Different kind of pipes (that would imply newer checkings in future). Pipes can be PropertyComputingPipes (they only computes properties and do not transform the data), TransformDataPipes (they transform data), TargetGuesser (they detect the target attribute for classification of prediction issues) and TeePipes (that save to file/s all instances and detect the last instance to save in order to allow pipe programmers to have open files, i.e. CSV files and close them with isLast()).
  • Instance invalidation. An instance can be invalidated at any moment of the process and will not be further processed.
  • Some dataset utilities to facilitate integration with Weka. (see Main class in the provided example)
  • Parallel pipe (in progress but... the usage would be limited to the API and development is currently stopped).
  • A javax/swing GUI to build a pipe-based task (future)

Although basic pipe functionalities work, we are still developing most of the interesting functionalities. As you imagine (by reading the description provided), there is a lot of work to do.

I submit one example here to preprocess SMS messages extracted from http://www.esp.uem.es/jmgomez/smsspamcorpus/. This is very simple but you can find in the example several pipes of different pipes working together. We have also integrated the example into the source (repository). In the example, we use simple data but you can extract properties in a more complex form.

Hope you can find our project useful. We are working hard to complete more functionalities but our team is small (one Grade Student, one PH.D. Student and a teacher with lots of things to do) (so we work slow). But we hope we can make the entire functionality working before summer (August).

Thanks for your interest! Below the example.

bdp4j_sample.zip

PD. If you finally use bdp4j, we would appreciate if you let us know about your project. Of course, we could solve your doubts and help to get everything working.

With best regards.
bdp4j Team (Yeray, María & Moncho)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants