1.0.1 • Published 5 years ago

taskmadstreetden v1.0.1

Weekly downloads
-
License
ISC
Repository
-
Last release
5 years ago

Sample scraped files are in the directory for the clothing section of men and women for both sites shop mango and george

Once installed lets get started !!

npm i taskmadstreetden

NPM is the package manager like pip in python and this module is published in npms directory

Now you would have all the files in place !

To run the rest api server all u need to do is type node app.js in the root dir of the project in a cmd. This would initialize the express server.

Coats, Jackets, Suits, Dresses, Jumpsuits, Cardigans and sweaters, Shirts, T-shirts and tops, Trousers, Jeans, Skirts,

Coats, Jackets, Blazers, Suits, Cardigans and sweaters, Sweatshirts, Shirts, T-shirts, Trousers, Jeans, Underwear,

http://localhost:3472/shopmango/women?type=clothing&ct=Jumpsuits,jackets,suits&browser=true By default chromium browser is opened to simulate the automation however u can turn this off by using browser=true

Same applies for men change women to men and use the correct categories

Now it starts to eavesdrop the packets sent back and forth and identifies the network xhr request which has the url

https://shop.mango.com/services/cataloglist/filtersProducts/IN/he/sections_he.prendas_he/?idSubSection=abrigos_he&menu=familia;106&pageNum=1000&rowsPerPage=20&columnsPerRow=4

This one .. The ids and strings change here and there so we eaves drop and once this is detected

We go to an infinite loop

We transform the link into

https://shop.mango.com/services/cataloglist/filtersProducts/IN/he/sections_he.prendas_he/?idSubSection=abrigos_he&menu=familia;106&pageNum=1 which would give maximum no of items in page no and do a await ajax request and get the json and process it

This process is repeated for page 2(click to see the response thats gonna be processed) https://shop.mango.com/services/cataloglist/filtersProducts/IN/he/sections_he.prendas_he/?idSubSection=abrigos_he&menu=familia;106&pageNum=2

By incrementing the pagenos and so on untill at one point for a page no response would be null,we break out of the infinite loop Now that we have scraped all the data we move to next link in the stack and so on

Everything is asynchronous and all the racial conditions have been exceptionally handled

As all the resources are lazy loaded To scrape the data from the website's dom is inefficient as there are an array of images per product and they only get loaded into the dom only when u hover over them,this is an efficient solution

However ih've written the algorithm for hover and getting data from dom as well

http://localhost:3472/shopmango/array/men?type=clothing&ct=coats

It does an infinite scroll to trigger and load all the lazy loaded data and does an automated hover of 300ms on each product to load in their data

But its slow compared to the previous algo.

http://localhost:3472/george/women?type=clothing gives all women clothing

http://localhost:3472/george/men?type=clothing&ct=jeans

Similar to the first one, here tracking the vulnerabity was easier and it followed a pattern https://direct.asda.com/on/demandware.store/Sites-ASDA-Site/default/SearchDEP-Show?cgid=${ctid}&start=0&sz

ctid is the category id which is in the url of each page so retieving it and giving it to the link would give all the items for the category including the ones that are lazy loaded for eg.Dresses

https://direct.asda.com/george/women/dresses/D1M1G20C1,default,sc.html -original page

getting the category and voila !

https://direct.asda.com/on/demandware.store/Sites-ASDA-Site/default/SearchDEP-Show?cgid=D1M1G20C1&start=0&sz

Easy! Doing this in a loop for all the pages needed would finish the job

The api can be accessed in a browser as well as postman or using http requests

Post man might be unable to beautify the json to make it readable,download the json from postman and paste it in ur favourite editor and use ur beautify plugin to beautify it or using any other online beautifier

HTTP requests and browsers are completely fine those make sure to parse them to view them beauitfully !

1.0.1

5 years ago

1.0.0

5 years ago