GitHub - nizarmassouh/WebDB: For the creation of image databases from the web.

A python implementation of the IROS 2017 paper: Learning Deep Visual Object Models From Noisy Web Data: How to Make it Work

Create image databases from the Web. This software will expand a list of visual queries and download images from 3 search engines (google, yahoo and bing).

Requirements

Python 3.8 or above
Selenium 3.141 or above
BeautifulSoup 4.9.3 or above
Firefox
Firefox webdriver

Setting up Firefox Webdriver

There are couple of ways to install Firefox Webdriver.

If you are on debian based distribution sudo apt-get install firefox-geckodriver would install the Firefox webdriver. by default the driver is added to the $PATH.

or

Extract the contents of the downloaded file. For file ending with .zip extension use unzip <filename> or file ending with .tar.gz extension use tar -xvf <filename>
Finally, add the extracted file to $PATH. Refer how to add executables to PATH

Once the driver is installed, check by running geckodriver in the terminal. User should see similar output.

1621246053648 geckodriver INFO Listening on 127.0.0.1:4444

Press ctrl + c to close the driver.

Create Expanded Query list

The query expansion works by finding the top 50 keywords that best match the initial query using http://mykeyworder.com/keywords. The result is initial query append by each keyword found. e.g. If initial query is dog the query expansion output would be dog animal, dog pet. etc.

Run the bash script expand.sh with 1 argument file containing (the list of queries that are separated by line) from terminal

For e.g.

sh expand.sh queries.txt

The expanded query will be saved to the file expanded_queries.txt with entries comma separated.

Download Web Images

User can download the images by running download.py using 3 arguments: example:

python download.py --queries <queries.txt or expanded_queries.txt> --directories dirnames.txt --run_headless

--queries A file containing the search queries or the expanded queries.
--directories: A file containing the directory name where the downloaded images are stored
--run_headless: Argument that doesn't display the browser when script runs. Don't pass this argument when you don't need to visualize the script in action. This is useful for debugging purposes and browser navigation works as expected.

where queries.txt is a text file containing list of queries and dirnames.txt is the equivalent directory name of each query line by line.

	queries.txt:	query1
			query2
			...
	dirnames.txt:	dir1
     		 	dir2
			...

Note: It's also possible to provide single entry in the dirnames.txt file, so that all the images are saved in a single folder.

In case of query expansion the queries provided on the same line separated with a comma can be downloaded into the same directory name.

queries.txt: query1.1, query1.2, query1.3...     dirnames.txt: dir1
    	     query2.1, query2.2, query2.3...                   dir2
	     ...                                               ...

How to debug

All the search engines constantly try to introduce changes to their webpages in order to restrict webscraping, therefore scripts in this repository requires maintenance when they break.

We provide very basic methods for debugging the scripts.

We try to catch and log the exceptions in the console during HTML parsing or image retrieval. Analyzing the stack trace could help identify the root cause.
Visualizing the selenium action on the browser could help quickly identify any browser rendering issue. Remove --run_headless flag argument when invoking download.py.
We use css selectors in order get uniquely identify selector tags in the webpage. Search engines could update these tags which might cause the script to fail.
Ensure you are using the latest version of Firefox browser and update the User-Agent accordingly. Depending upon browser version and User-Agent the page might render differently which could cause the script to fail.
If you observe slower network speed. Try increasing the script sleep time.

The script is expected to run slower with varying sleep times. This is mimic human behavior and not to flood server with constant requests. Warning: Repeated requests with almost no wait time could lead to blacklisting of the user's IP.

How to use css selector

How to check tags are uniquely identified.
- Open Firefox, Navigate to "www.google.com"
- Press F12 to open the developer tools
- Press Ctrl + Shift + c to pick elements from the page
- Hover the mouse over Input Search text box in google search page.
- In the inspector tab, you should see the <input class="gLFyf gsfi"..etc..> selected
- Now click Console tab. Type $$("input.gLFyf.gsfi") and hit enter. Spaces in the class name should be replace with .
- You should see the output html tag displayed

Using the above tag in selenium is very easy

 input_text_box = driver.find_element_by_css_selector("input.gLFyf.gsfi")

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
Download		Download
Duplicates Removal		Duplicates Removal
Isolated Images Filter		Isolated Images Filter
Query Expansion		Query Expansion
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
dirnames.txt		dirnames.txt
download.py		download.py
expand.sh		expand.sh
queries.txt		queries.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A python implementation of the IROS 2017 paper: Learning Deep Visual Object Models From Noisy Web Data: How to Make it Work

Requirements

Setting up Firefox Webdriver

Create Expanded Query list

Download Web Images

How to debug

How to use css selector

About

Releases

Packages

Languages

License

nizarmassouh/WebDB

Folders and files

Latest commit

History

Repository files navigation

A python implementation of the IROS 2017 paper: Learning Deep Visual Object Models From Noisy Web Data: How to Make it Work

Requirements

Setting up Firefox Webdriver

Create Expanded Query list

Download Web Images

How to debug

How to use css selector

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages