taskmadstreetden v1.0.1
Sample scraped files are in the directory for the clothing section of men and women for both sites shop mango and george
Once installed lets get started !!
npm i taskmadstreetden
NPM is the package manager like pip in python and this module is published in npms directory
Now you would have all the files in place !
To run the rest api server all u need to do is type node app.js in the root dir of the project in a cmd. This would initialize the express server.
Coats, Jackets, Suits, Dresses, Jumpsuits, Cardigans and sweaters, Shirts, T-shirts and tops, Trousers, Jeans, Skirts,
Coats, Jackets, Blazers, Suits, Cardigans and sweaters, Sweatshirts, Shirts, T-shirts, Trousers, Jeans, Underwear,
http://localhost:3472/shopmango/women?type=clothing&ct=Jumpsuits,jackets,suits&browser=true By default chromium browser is opened to simulate the automation however u can turn this off by using browser=true
Same applies for men change women to men and use the correct categories
Now it starts to eavesdrop the packets sent back and forth and identifies the network xhr request which has the url
This one .. The ids and strings change here and there so we eaves drop and once this is detected
We go to an infinite loop
We transform the link into
https://shop.mango.com/services/cataloglist/filtersProducts/IN/he/sections_he.prendas_he/?idSubSection=abrigos_he&menu=familia;106&pageNum=1 which would give maximum no of items in page no and do a await ajax request and get the json and process it
This process is repeated for page 2(click to see the response thats gonna be processed) https://shop.mango.com/services/cataloglist/filtersProducts/IN/he/sections_he.prendas_he/?idSubSection=abrigos_he&menu=familia;106&pageNum=2
By incrementing the pagenos and so on untill at one point for a page no response would be null,we break out of the infinite loop Now that we have scraped all the data we move to next link in the stack and so on
Everything is asynchronous and all the racial conditions have been exceptionally handled
As all the resources are lazy loaded To scrape the data from the website's dom is inefficient as there are an array of images per product and they only get loaded into the dom only when u hover over them,this is an efficient solution
However ih've written the algorithm for hover and getting data from dom as well
http://localhost:3472/shopmango/array/men?type=clothing&ct=coats
It does an infinite scroll to trigger and load all the lazy loaded data and does an automated hover of 300ms on each product to load in their data
But its slow compared to the previous algo.
http://localhost:3472/george/women?type=clothing gives all women clothing
http://localhost:3472/george/men?type=clothing&ct=jeans
Similar to the first one, here tracking the vulnerabity was easier and it followed a pattern https://direct.asda.com/on/demandware.store/Sites-ASDA-Site/default/SearchDEP-Show?cgid=${ctid}&start=0&sz
ctid is the category id which is in the url of each page so retieving it and giving it to the link would give all the items for the category including the ones that are lazy loaded for eg.Dresses
https://direct.asda.com/george/women/dresses/D1M1G20C1,default,sc.html -original page
getting the category and voila !
Easy! Doing this in a loop for all the pages needed would finish the job
The api can be accessed in a browser as well as postman or using http requests
Post man might be unable to beautify the json to make it readable,download the json from postman and paste it in ur favourite editor and use ur beautify plugin to beautify it or using any other online beautifier
HTTP requests and browsers are completely fine those make sure to parse them to view them beauitfully !